EACL 2012

Joint Workshop on Exploiting Synergies between Information Retrieval and (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra) at EACL-2012

Proceedings of the Workshop c 2012 The Association for Computational Linguistics

ISBN 978-1-937284-19-0

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ii Introduction

Welcome to the Joint EACL Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra). This two-day workshop addresses two specific but related research problems in computational linguistics.

The ESIRMT event (1st day) aims at reducing the gap, both theoretical and practical, between information retrieval and machine translation research and applications. Although both fields have been already contributing to each other instrumentally, there is still too much work to be done in relation to solidly framing these two fields into a common ground of knowledge from both the procedural and paradigmatic perspectives.

The HyTra event (2nd day) aims at sharing ideas among researchers developing and applying statistical, example-based, or rule-based machine translation systems and who wish to enhance their systems with elements from the other approaches.

The joint workshop provides participants with the opportunity of discussing research related to technology integration and system combination strategies at both the general level of cross-language information access and the specific level of machine translation technologies.

This workshop has been supported by the Seventh Framework Programme of the European Comission through the T4ME (METANET) contract (grant agreement no.: 249119), through the TTC contract (grant agreement no.: 248005), through the Marie Curie HyghTra contract and by the Spanish Ministry of Economy and Competivity through the BUCEADOR project (TEC2009-14094-C04-01) and the Juan de la Cierva fellowship program.

We would like to thank all people who in one way or another helped in making this workshop a success. Our special thanks go to our plenary speakers, to the speakers of our invited project session, to our sponsors, to the participants of the panel discussion, to the members of the program committee who did an excellent job in reviewing the submitted papers, and to the EACL organizers, in particular the workshop general chairs Kristiina Jokinen and Alessandro Moschitti. Last but not least we would like to thank our authors and the participants of the workshop.

The Organizers Avignon, France, April 2012

iii

Organizers ESIRMT:

Marta R. Costa-jussa` (Barcelona Media Innovation Center) Patrik Lambert (University of Le Mans) Rafael E. Banchs (Institute for Infocomm Research)

Organizers HyTra:

Reinhard Rapp (Universities of Mainz and Leeds) Bogdan Babych (University of Leeds) Kurt Eberle (Lingenio GmbH) Tony Hartley (Toyohashi University of Technology and University of Leeds) Serge Sharoff (University of Leeds) Martin Thomas (University of Leeds)

Program Committee:

Jordi Atserias, Yahoo! Research , Barcelona, Spain Sivaji Bandyopadhyay, Jadavpur University, Kolkata, India Nuria´ Bel, Universitat Pompeu Fabra, Barcelona, Spain Pierrette Bouillon, ISSCO/TIM/ETI, University of Geneva, Switzerland Chris Callison-Burch, Johns Hopkins University, Baltimore, USA Michael Carl, Copenhagen Business School, Denmark Oliver Culo, ICSI, University of California, Berkeley, USA Andreas Eisele, Directorate-General for Translation, European Commission, Luxembourg Marcello Federico, Fondazione Bruno Kessler, Trento, Italy Jose´ A. R. Fonollosa, Universitat Politecnica` de Catalunya, Barcelona, Spain Mikel Forcada, University of Alicante, Spain Alexander Fraser, Institute for Natural Language Processing (IMS), Stuttgart, Germany Johanna Geiß, Lingenio GmbH, Heidelberg, Germany Mireia Ginesti-Rosell, Lingenio GmbH, Heidelberg, Germany Silvia Hansen-Schirra, FTSK, University of Mainz, Germany Gareth Jones, Dublin City University, Ireland Min-Yen Kan, National University of Singapore Udo Kruschwitz, University of Essex, UK Yanjun Ma, Baidu Inc. Beijing, China Maite Melero, Barcelona Media Innovation Center, Barcelona, Spain Haizhou Li, Institute for Infocomm Research, Singapore Paul Schmidt, Institut for Applied Information Science, Saarbrucken,¨ Germany Uta Seewald-Heeg, Anhalt University of Applied Sciences, Kothen,¨ Germany Nasredine Semmar, CEA LIST, Fontenay-aux-Roses, France Wade Shen, Massachusetts Institute of Technology, Cambridge, USA Fabrizio Silvestri, Instituto de Scienza e Tecnologia del’Informazione, Pisa, Italy Harold Somers, CNGL, Dublin City University, Ireland Anders Søgaard, University of Copenhagen, Denmark Jorg¨ Tiedemann, University of Uppsala, Sweden Zygmunt Vetulani, University of Poznan, Poland

v Invited Speakers: Christof Monz (University of Amsterdam) Philipp Koehn (University of Edinburgh)

Speakers of the Invited Project Session: Cristina Vertan George Tambouratzis, Marina Vassiliou and Sokratis Sofianopoulos John Tinsley, Alexandru Ceausu and Jian Zhang Svetla Koeva

vi Table of Contents

Semantic Web based Machine Translation Bettina Harriehausen-Muhlbauer¨ and Timm Heuss ...... 1

Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi- )Parallel Translation Equivalents Fangzhong Su and Bogdan Babych ...... 10

Full Machine Translation for Factoid Question Answering Cristina Espana-Bonet˜ and Pere R. Comas...... 20

An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation Tze Yuang Chong, Rafael Banchs and Eng Siong Chng ...... 30

Natural Language Descriptions of Visual Scenes Corpus Generation and Analysis Muhammad Usman Ghani Khan, Rao Muhammad Adeel Nawab and Yoshihiko Gotoh ...... 38

Combining EBMT, SMT, TM and IR Technologies for Quality and Scale Sandipan Dandapat, Sara Morrissey, Andy Way and Josef van Genabith ...... 48

Two approaches for integrating translation and retrieval in real applications Cristina Vertan ...... 59

PRESEMT: Pattern Recognition-based Statistically Enhanced MT George Tambouratzis, Marina Vassiliou and Sokratis Sofianopoulos ...... 65

PLUTO: Automated Solutions for Patent Translation John Tinsley, Alexandru Ceausu and Jian Zhang ...... 69

ATLAS - Human Language Technologies integrated within a Multilingual Web Content Management System SvetlaKoeva...... 72

Tree-based Hybrid Machine Translation Andreas Søeborg Kirkedal ...... 77

Were the clocks striking or surprising? Using WSD to improve MT performance Spelaˇ Vintar, Darja Fiserˇ and Aljosaˇ Vrsˇcaj...... ˇ 87

Bootstrapping Method for Chunk Alignment in Phrase Based SMT Santanu Pal and Sivaji Bandyopadhyay ...... 93

Design of a hybrid high quality machine translation system Bogdan Babych, Kurt Eberle, Johanna Geiß, Mireia Ginest´ı-Rosell, Anthony Hartley, Reinhard Rapp, Serge Sharoff and Martin Thomas ...... 101

Can Machine Learning Algorithms Improve Phrase Selection in Hybrid Machine Translation? Christian Federmann ...... 113

Linguistically-Augmented Bulgarian-to-English Statistical Machine Translation Model Rui Wang, Petya Osenova and Kiril Simov ...... 119

Using Sense-labeled Discourse Connectives for Statistical Machine Translation Thomas Meyer and Andrei Popescu-Belis ...... 129

vii

ESIRMT-HyTra Workshop Program

Monday 23rd

(9:00-9:30) Workshop Presentation

(9:30-10:30) Invited Talk Christof Monz

(10:30-11:00) Coffee break

(11:00-11:30) ESIRMT Morning Session

Semantic Web based Machine Translation Bettina Harriehausen-Muhlbauer¨ and Timm Heuss

(11:30-12:00)

Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Ex- traction of (Semi-)Parallel Translation Equivalents Fangzhong Su and Bogdan Babych

(12:00-12:30)

Full Machine Translation for Factoid Question Answering Cristina Espana-Bonet˜ and Pere R. Comas

(12:30-13:00)

An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation Tze Yuang Chong, Rafael Banchs and Eng Siong Chng

ix Monday 23rd (continued)

(13:00-15:00) Lunch break

(15:00-15:30) ESIRMT Afternoon Session

Natural Language Descriptions of Visual Scenes Corpus Generation and Analysis Muhammad Usman Ghani Khan, Rao Muhammad Adeel Nawab and Yoshihiko Gotoh

(15:30-16:00)

Combining EBMT, SMT, TM and IR Technologies for Quality and Scale Sandipan Dandapat, Sara Morrissey, Andy Way and Josef van Genabith

(16:00-16:30) Coffee break

(16:30-17:00) Project session

Two approaches for integrating translation and retrieval in real applications Cristina Vertan

(17:00-17:30)

PRESEMT: Pattern Recognition-based Statistically Enhanced MT George Tambouratzis, Marina Vassiliou and Sokratis Sofianopoulos

(17:30-18:00)

PLUTO: Automated Solutions for Patent Translation John Tinsley, Alexandru Ceausu and Jian Zhang

x Monday 23rd (continued)

(18:00-18:30)

ATLAS - Human Language Technologies integrated within a Multilingual Web Content Management System Svetla Koeva

Tuesday 24th

(9:00-9:30) HyTRA 1st Morning Session

(9:30-10:00)

Tree-based Hybrid Machine Translation Andreas Søeborg Kirkedal

(10:00-10:30) Coffee break

(10:30-11:30) Invited Talk Philipp Koehn

(11:30-12:00) HyTRA 2nd Morning Session

Were the clocks striking or surprising? Using WSD to improve MT performance Spelaˇ Vintar, Darja Fiserˇ and Aljosaˇ Vrsˇcajˇ

(12:00-12:30)

Bootstrapping Method for Chunk Alignment in Phrase Based SMT Santanu Pal and Sivaji Bandyopadhyay

xi Tuesday 24th (continued)

(12:30-13:00)

Design of a hybrid high quality machine translation system Bogdan Babych, Kurt Eberle, Johanna Geiß, Mireia Ginest´ı-Rosell, Anthony Hartley, Reinhard Rapp, Serge Sharoff and Martin Thomas

(13:00-15:00) Lunch break

(15:00-15:30) Hytra Afternoon Session

Can Machine Learning Algorithms Improve Phrase Selection in Hybrid Machine Transla- tion? Christian Federmann

(15:30-16:00)

Linguistically-Augmented Bulgarian-to-English Statistical Machine Translation Model Rui Wang, Petya Osenova and Kiril Simov

(16:30-17:00)

Using Sense-labeled Discourse Connectives for Statistical Machine Translation Thomas Meyer and Andrei Popescu-Belis

(17:00-17:30) Coffee break

(17:30-18:30) Panel discussion

xii Semantic Web based Machine Translation

Bettina Harriehausen-Muhlbauer¨ Timm Heuss University of Applied Sciences University of Applied Sciences Darmstadt, Germany Darmstadt, Germany [email protected] [email protected]

Abstract MT, the Web has significantly grown and gained importance, especially in the recently defined This paper describes the experimental com- field of the Semantic Web. After having accepted bination of traditional Natural Language statistical methods as a promising change in MT, Processing (NLP) technology with the Se- we believe that a next logical step will combine mantic Web building stack in order to ex- tend the expert knowledge required for a MT with Semantic Web technology, resulting in Machine Translation (MT) task. Therefore, a new focus which can be called Semantic Web we first give a short introduction in the state Machine Translation (SWMT). of the art of MT and the Semantic Web In this paper, we will develop our ideas step by and discuss the problem of disambiguation step and will demonstrate on a sample sentence being one of the common challenges in including a lexical ambiguity that our approach MT which can only be solved using world knowledge during the disambiguation pro- does not involve a costly disambiguation pro- cess. In the following, we construct a sam- cess on the basis of parsing online-. ple sentence which demonstrates the need Instead, we believe that modern Information for world knowledge and design a proto- Technology (IT) is aligned and committed to in- typical program as a successful solution for formation and its markup, as the W3C Semantic the outlined translation problem. We con- Web technology stack1 demonstrates, and that we clude with a critical view on the developed can use the contained knowledge in our disam- approach. biguation process without additional MT rules or statistics being applied. 1 Introduction Over the past decades, Machine Translation (MT) 2 Development and change of focus in has undergone various changes with regard to MT : from the rule-based past to the the underlying technology. Starting in the mid- web-based future of MT dle of the last century with rule-based MT, a Traditionally, most MT systems were rule-based first logical step was taken towards the end of systems built on electronic analysis and ge- the century, when statistical methods in Natural neration grammars as well as a language-pair- Language Processing (NLP) gained overall im- dependent transfer component . These Rule-based portance, as the growing number of online avai- Machine Translation (RBMT) systems always in- lable texts could be used as a basis for statistical volved a careful and time-consuming develop- computations performed on these texts and trans- ment of grammatical rules. lations, which resulted in an enhancement of ex- More recent development in MT has started to isting rules, statistics and thus results. The new use the vast amount of texts and knowledge that is field of Statistical Machine Translation (SMT) available online for translations based on statistics was born and MT systems became increasingly better as more and more texts and translations 1http://semanticweb.org/wiki/Main_Page were available. In parallel to the developments in (URL last access 2011-12-18).

1 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–9, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics and probabilities, leading to a separate focus in which states that a string e in the target language MT, namely SMT. is the translation of a string f in the source lang- With the growing size of texts available in the uage. web, it is a logical next step to consider using Philipp Koehn, being among the most popu- the available knowledge in these texts to enhance lar SMT researchers and developers, also high- NLP applications, including MT, leading to a yet lights today’s quality of SMT and the relevance new focus, which we call SWMT. of the vast amounts of texts in the web, which In this chapter we will develop our idea by provide the basis for SMT translations, by stating starting with a look at how MT has developed over ”Now, armed with vast amounts of example trans- the past decades, how it has made use of the ex- lations and powerful computers, we can witness panding Web in recent years and where we see significant progress toward achieving that dream.” further potential in using existing knowledge for (Koehn et al. 2012) MT technology. The research field of statistical machine trans- lation is a rather new field. In his commented bib- 2.1 Statistical Machine Translation liography2 Koehn includes statistics about the dis- The dream of automatically translating docu- tribution of publications in the SMT field across ments from foreign languages into English, or be- the years 1953 until 2008. It is clearly shown that tween any two languages, is one of the oldest pur- only a few publications appeared before the mil- suits of NLP, being a subfield of artificial intel- lennium change and that SMT clearly became an ligence research. Traditional MT systems com- issue of growing interest in the new millennium, puted translations primarily on the basis of ana- with a peak in 2006. Scientists working in the lysis and generation phrase-structure-rules, which MT field suddenly became aware of the relevance had to be manually coded in a costly fashion. and potential provided by statistics in machine One of the leading users of SMT is Google translation and computational linguistics in gen- and engineer Anton Andryeyev, eral. Still in 2003, Knight & Koehn stated, that who explains SMT’s essence as follows: ”the currently best performing statistical machine ”SMT generates translations based on patterns translation systems are still crawling at the bot- found in large amounts of text. [...] Instead of tom”, (Knight & Koehn 2004, p. 10), implying trying to teach the machine all the rules of a lang- that most of the approaches hadn’t gone beyond uage, SMT effectively lets computers discover the simple word to word translations yet and hadn’t rules for themselves. It works by analysing mil- included more advanced stages of NLP, like syn- lions of documents that have already been trans- tax or even semantics. Among those who made lated by humans [...]. essential contributions to the field of SMT was [...] Key to SMT are the translation dictio- Kevin Knight who stated in 1999 that ”We want naries, patterns and rules that the program devel- to automatically analyse existing human sentence ops. It does this by creating many different pos- translations, with an eye toward building general sible rules based on the previously translated doc- translation rules we will use these rules to trans- uments and then ranking them probabilistically. late new texts automatically.” (Knight 1999) Google admits this approach to translation in- The previous statements all point at the vast evitably depends on the amount of texts available knowledge included in the just as vast amounts in particular languages [...].” (Boothroyd 2011) of texts available in digital form in the internet, Therefore, with the change of available re- partly in the form of human sentence translations. sources and the growing number of natural lang- uage that is available in machine-readable for- At the same time that MT started clearly mov- mat as well as the growing number of users in- ing into using the Web to search for machine- putting corrections to machine translations man- readable texts and translations that could be used ually, thus allowing a direct and correct match in the expanding SMT field, Tim Berners-Lee between source and target texts, we have entered (Berners-Lee & Hendler 2001) defined the know- this subfield of MT which focuses on a statistical ledge, that is included in the Web content, to ex- analysis of texts, in which documents are trans- 2http://www.statmt.org/book/bibliograp lated according to a probability distribution p(e|f) hy/ (URL last access 2012-01-30).

2 pand the traditional WWW to become a Semantic 2004), rdfs:label, with an optional lang- Web uage notation following RFC-30665 (Klyne & As we are looking at an expanded view of how Carroll 2004). In addition to that, the Sim- to use the Web, and specifically the Semantic ple Knowledge Organization System (SKOS) Web, for our approach of MT, we would like to ontology features a small selection of uni- draw parallels between what has been said so far code labels for ”creating human-readable rep- about MT and the innovative possibilities that the resentations of a knowledge organization sys- Semantic Web provides for MT research. tem”, skos:prefLabel, skos:altLabel and skos:hiddenLabel - but also remarks 2.2 W3C Semantic Web that it ”does not necessarily indicate best practice The World Wide Web (WWW) was once designed for the provision of labels with different language to be as simple, as decentralized and as interop- tags” (Miles 2008). erable as possible (Berners-Lee 1999, 36f.). The Some alternatives developed to the approches Web evolved and became a huge success, how- above, to address limitations especially of ever, information was limited to humans. In or- rdfs:label and to represent natural language der to make information available to machines, within semantic knowledge in a more sophisti- an extending and complementary set of technolo- cated way, like the SKOS eXtension for Labels gies was introduced in the new millennium by (SKOS-XL)6, Lemon7 and LexInfo8. the W3C, the Semantic Web3 (Berners-Lee & And even in the area of wordnets, which Hendler 2001). might be considered as a more traditional NLP The base technology of the Semantic Web is domain, W3C Semantic Web technology plays the data format Resource Description Framework a role, as approaches were developed to bridge (RDF). Aligned to the so called AAA slogan the gap between natural language representations that ”Anyone can say Anything about Any topic” within these wordnets and the design principles (Allemang & Hendler 2008, p. 35), it defines a of the Semantic Web (Graves & Gutierrez 2005). structure that is meant to ”be a natural way to The conversion of Princeton WordNet9, for describe the vast majority of the data processed example, to RDF/OWL is covered by a W3C by machines” (Berners-Lee & Hendler 2001). In Working Draft (van Assem et al. 2006) or addition to the AAA slogan, a basic construc- the GermaNet wordnet10 equivalent approach tion paradigms of the Semantic Web is the Open (Kunze & Lungen¨ 2007), adapting the ideas of World Assumption - the fact that there is always the Princeton WordNet conversation. more knowledge than we currently know; new knowledge can always be added later. We decided to give a brief overview of the RDF expresses meaning by encoding it in sets state-of-the-art of SMT and the Semantic Web, of triples (Berners-Lee & Hendler 2001), com- as both areas of research are not only very new posed of subject, predicate and object, which are, developments but they share using information in in the N3-notation format4, likewise written down the Web for their applications and they both offer as triples: promising enhancements to traditional, rule-based :subject :predicate :object MT technology. Nevertheless, SMT and Seman- tic Web Technologies have fundamental differ- We see strong connections between MT and the W3C Semantic Web. 5http://www.ietf.org/rfc/rfc3066.txt A lot of ideas exist on how to augment (URL last access 2012-01-25). 6http://www.w3.org/TR/skos-reference/ the Resource Description Framework (RDF) skos-xl.html (URL last access 2012-01-26). - the base format of the Semantic Web - 7http://www.w3.org/International/mult with natural language. Since the beginning, ilingualweb/madrid/slides/declerck.pdf RDF itself provided capacities for a ”human- (URL last access 2012-01-31). 8 readable version of a resource’s name” (Guha http://lexinfo.net/ (URL last access 2012-01- 31). 3http://www.w3.org/standards/semantic 9http://wordnet.princeton.edu/ (URL last web/ (URL last access 2012-01-25). access 2012-01-26). 4http://www.w3.org/DesignIssues/Notati 10http://www.sfs.uni-tuebingen.de/lsd/ on3.html (URL last access 2012-01-29). (URL last access 2012-01-26).

3 ences in that SMT, with systems like Moses11, Ba- Taking care of lacking expert knowledge with bel Fish or Google compute their translations on Semantic Web technology and thus extending ex- a pure probability count of n-grams of different isting MT technology seems to be a promising length in order to find the best translation by pick- research area. Instead of just combining RBMT ing the one that gives the highest probability. As with SMT, we suggest to add the power of the Se- these systems have access to a growing text cor- mantic Web to these existing technologies, as the pus, which is, as in the case of Moses, directly en- previous approaches were not able to extract and hanced by collecting manual corrections given by use knowledge from the Web in their translation users after the system has computed an inadequate algorithms and thus leave ambiguities unsolved. translation, they become better with time. But The previously quoted statements made it clear exactly these statistically based computations are that MT can only be enhanced on the basis of neither possible nor allowed in the Semantic Web a growing size of text. We claim that the next because of the Open World Assumption. logical step is to use this growing size of text not only statistically, but in a well-defined way 3 New idea: Enhancing NLP with which is offered by Semantic Web technology. Semantic Web technology The power of our idea is the combination of a With our new approach, we suggest to base MT strong, proven technology with a popular, open, on a newly defined set of rules, which differ both machine-readable data format. from rules known from earlier MT approaches In order to demonstrate how our approach will but also from any rules that are applied in SMT. enhance existing MT systems, we chose to use Our rules follow Tim Berners-Lees vision, in that a variety of MT systems, some rule-based (e.g. knowledge, once defined and formalized, is acces- PT12), other statistic-based (e.g. Babel Fish, sible in arbitrary ways. As mentioned earlier, we Google, and Moses) to compare their context-free believe that modern IT follows the commitment translation results against our approach. We use of information and it’s markup, and the Semantic those context-free translation results as a starting Web technology stack is a perfect implementation point for further processing with Semantic Web of that paradigm. technology. Traditional MT technology should To demonstrate our approach, we selected a therefore not be replaced, but enhanced with se- common and well known issue: The problem in mantics, to benefit from the advantages provided many areas of NLP is the ambiguity of natural by the Web. language on various levels, from word level to In our sample scenario, the required world sentence level. In many cases, strings can only knowledge for the sample sentence Pages by be disambiguated on the basis of world or expert Apple is better than Word by MS. knowledge. How else would a machine decide on is modelled as RDF instances. We selected a whether the prepositional phase is modifying the simple file-based storage, with the actual trans- verb or the preceding noun in ”He eats fish with a lations being stored as rdfs:labels13 which fork.” vs. ”He eats fish with bones.”? Especially are localized as defined in Best Common Practice with translations, it is often crucial to understand 4714 (BCP47). To take advantage of the power- the source text correctly, as otherwise ambiguities ful Semantic Web tool set, parts of the world may result in incomprehensible target language knowledge are not directly defined, but can be translations, as the examples below will demon- inferenced by Web Ontology Language (OWL) strate. capacities. The goal is to produce a semantically The state of the art technology of the World good translation for the given sentence. Wide Web to express information, facts and re- lations for both humans and machines is RDF. So it is not unlikely that nowadays expert knowledge is encoded in that format, too. 12Personal Translator 14 distributed by Linguatec. 13http://www.w3.org/TR/rdf-schema/##ch_ 11Moses is a statistical machine translation system devel- label (URL last access 2011-12-19). oped by the Statistical MT Research group of University of 14http://www.rfc-editor.org/rfc/bcp/bcp Edinburgh, http://www.statmt.org/moses/. 47.txt (URL last access 2011-12-19).

4 3.1 A sample scenario Personal Translator PT 14 The first step is the construction of an expres- Paginiert von Apple ist sive sample scenario where world knowledge is besser als durch MS critical for the MT. We looked at the results a auszudrucken.¨ number of different translation tools computed for All translations failed, because they did not our sample sentence: Google Translator15, Bing take semantic relations into consideration. This is Translator16, an online demo of Philipp Koehn’s a systematic issue in MT, demonstrating the ne- Moses17, Linguatec Personal Translator PT 1418 cessity of including world knowledge in the com- (rule-based) and the reference translation in this putation of the target translation. paper, Yahoo! Babel Fish19. Research concluded with the following sen- 4 More examples tence, requiring the ”expert knowledge” that a vendor called Apple produced a product named As ambiguities are a common MT problem, there Pages and a vendor called MS (very popu- are various examples where MT can be enhanced lar shortform of Microsoft) a product named by world knowledge. Word: Consider, for example, popular persons that have ambiguous last names - like the politicians Pages by Apple is better George W. Bush, Helmut Kohl20, Joschka Fis- than Word by MS. cher21 to name a few. MT systems are likely One important measure to stress the transla- to translate those names if they are not included tion service is to use ”indirect” product names in dedicated expert dictionaries. But thanks to 22 (Pages by Apple and not Apple Pages) projects like DBpedia , we already have the to prevent them from deriving product names knowledge available in a Semantic Web accessi- from possible entries. Another ”trap” ble format and could just use it. was to abbreviate Microsoft with MS to irritate Another area that might benefit from a Seman- possible n-gram-statistics. tic Web Machine Translation is the internation- The resulting German translations of the sam- alization of technical documents or handbooks, ple sentence were the following: which usual deal with several termini technici. Once modelled in RDF, the required expert know- Google Translator: ledge is universally present and could aid the Pages von Apple ist besser translation process as well. als Word MS. 5 Analysis Bing Translator: Seiten von Apple ist besser World knowledge is the crucial point for the trans- als MS Word. lation quality of the selected sample sentences. It becomes obvious that in situations like this, Babel Fish: with missing expert dictionaries, rule sets or lack- Seiten durch Apple ist besser ing statistical tooling like N-grams, the translation als Wort durch Frau. quality is relatively low. And this is not an unre- Moses Machine Translation Demo: alistic scenario: There will always be uncovered Seiten von Apple ist besser areas in expert dictionaries or missing statistics in als Word von MS behandelt. a certain domain. In the given example, if we are looking at the 15http://translate.google.de/ (URL last ac- cess 2011-12-18). Babel Fish translation, the translation engine was 16http://www.microsofttranslator.com/ totally mousetrapped as it translated the Apple (URL last access 2011-12-18). product Pages with the obviously context free, 17http://demo.statmt.org/index.php (URL last access 2012-01-29). 20The proper name Kohl is also the German word for cab- 18http://www.linguatec.net/products/tr/ bage. pt (URL last access 2012-01-29). 21Fischer means fisherman in German. 19http://de.babelfish.yahoo.com/ (URL last 22http://dbpedia.org/ (URL last access 2012-03- access 2011-12-18). 12).

5 German translation Seiten. Furthermore, it in- 6.1 Trivial word dictionary terpreted MS as salutation and Word as the Ger- To fake Babel Fishs translation logic, a very man Wort - all mistakes made caused by lexical simplified dictionary is defined with the content ambiguities because of the lack of context know- aligned at its online pendant. As figure 2 shows, ledge. the context free translation is reproduced with 6 Implementation word-by-word translations. In order to prove our idea, we have developed English German a prototypical application implementing a Se- Apple Apfel mantic Web enhanced SMT. One principal de- Pages Seiten sign goal was to keep the program simple, but Word Wort to apply state-of-the-art Semantic Web technol- better besser ogy like RDF and the query language SPARQL, ...... which are both W3C recommendations. Figure 2: Simplified dictionary to reproduce Babel Fishs simple and context free translation results.

6.2 Semantic Core The much more interesting part is modelling the world knowledge with Semantic Web technolo- gies. Thereby, a simple file based RDF store is used. The notation format is consistently N324, because of its very good human-readability. As mentioned in previous sections, world knowledge about Apple and Microsoft is crucial in this translation task. So the first state- ments within the RDF store are about both ven- dors and the products they produce25: :apple a :vendor, :trigger; Figure 1: Architectural overview of the involved com- rdfs:label "Apple"; ponents and exchanged tokens. :produces :numbers , :pages , :iphone . And because of the powerful but easy to use In this case, the instance :apple is defined Jena Semantic Web Framework23, a prototype im- to be of the types :vendor and :trigger. plementation is written in the Java programming While the former type has no special meaning language. The involved MT-components are: in this context, the latter is especially impor- Trivial word dictionary Performs a one-by-one tant: :trigger-instances mark significant key- word translation. Entries are designed to re- words, indicating that additional world know- flect the translation results of Babel Fish. ledge should be loaded when they occur in a sentence. So in this example occurrence of the Semantic Core Reads a file based RDF triple word Apple (rdfs:label of :apple) in the store, executes SPARQL-queries and per- source text triggers loading and parsing of the forms reasoning to inference new know- :apple instance and all uses of it within the ledge. Resulting text phrases may override store. certain results derived by dictionary entries. Furthermore, some products are defined to be produced by :apple. The following sections give more details about the concrete implementation of those components 24http://en.wikipedia.org/wiki/Notation and the overall execution logic. 3 (URL last access 2011-12-19). 25For the sake of simplicity, all statements are aligned 23http://jena.sourceforge.net/ (URL last in the default namespace http://www.example.org/ access 2011-12-19). ##.

6 The property :produces as well as its oppo- 6.3 Wiring it together :producedBy site are defined as follows: As mentioned before, the Semantic World Know- :produces rdfs:label "produces"@en- ledge should enhance traditional MT translations. US, "produziert"@de-DE . Therefore, the program produces technically two :producedby rdfs:label "by"@en-US, " translations of the sentence Pages by Apple von"@de-DE . is better than Word by MS.: The first Note that both properties have dual-language- translation is done by the trivial dictionary, simply labels. This allows the program express by string-replacing English with German words the world knowledge :apple :produces according to figure 2. The second translation :iphone in simple but natural first tries to find a better translation by checking as well as in German. trigger keywords, querying the RDF store for a In the next step, both properties are semanti- knowledge, inferencing relationships and resolv- cally connected as owl:inverseOf each other: ing labels for the right language, before it contin- :produces owl:inverseOf :producedby ues with the same word-by-word-replacing mech- anism like in the fist translation. This few statements already allow inferenc- ing - reasoning about information that is given implicitly. So it is not only a fact that :apple :produces :iphone, but also af- ter OWL-inferencing the fact that :iphone :producedBy :apple - without having to state that directly. Finally, the products get their proper names as- signed: :numbers rdfs:label "Numbers" . :word rdfs:label "Word" . :windows rdfs:label "Windows" . :pages rdfs:label "Pages" . This few lines form the knowledge base which is, thanks to inferencing, sufficient to solve the translation task. The following dictionary entries can directly be read out of the RDF knowledge base: Microsoft produces Windows MS produces Windows Microsoft produces Word MS produces Word Apple produces Pages Figure 3: The two translations produced by the pro- Apple produces Numbers gram and their technological foundation. By evaluating the predicates :produces and inferencing the :producedBy statements, the 6.4 Program execution knowledge base in addition contains the inverted entries: Our prototype simply executes both described translations and print the result out. Word by MS Word by Microsoft Source sentence: Word produced by MS Pages by Apple is better than Word produced by Microsoft Word by MS. Windows by MS Windows by Microsoft Semantic Web enhanced translation: Windows produced by MS Pages von Apple ist besser als Windows produced by Microsoft Word von MS.

7 These simple lines, specially the ”Semantic program works as expected and Semantic Web Web enhanced translation”, involve a lot of pro- technology was successfully used to integrate cessing in the background which is not visible world knowledge into a MT process. Thus, the to the user - except for his waiting time. How- translation gathered a better quality and it thus can ever, a semantically correct translation solution be stated that the experiment was successful. was found. 8 Related work 7 Critical view on the solution The project Monnet has, according to its mission We feel we created something notable here. How- statement26, a similar idea to combine MT with ever, we stand at the very beginning of our re- Semantic Web technology. However, results are search and have encountered corresponding is- still pending or not publically accessible at this sues. point. Surprisingly, implementation of the program We also acknowledge the work by Elita and logic - especially the query mechanism - turned Birladeanu (2005), who outlined the combination out to be quite complicated, even for a simple sce- of the Semantic Web with Example Based Ma- nario like in this case with a very limited corpus. chine Translation (EBMT), which is very much As a result, the stepwise refinement of a transla- related to our approach. However, there are ma- tion (trigger word, query of knowledge, inference jor differences: Elita and Birladeanu (2005) only relationships and multi-language-label resolution) applied their technique on certain phrases of off- consists of a lot SPARQL queries. These queries ical documents - sequences of words they call require some processing time and power, which ”fields” (Elita & Birladeanu 2005, p. 14). Our is both already notable in this tiny example. This idea is however to aid translation of complete sen- finally leads to the conclusion that performance tences. Another very important difference is the might be a major withdraw of our approach, at intensiveness of use of W3C technology. Unlike least for the current implementation. Elita and Birladeanu (2005), we heavily use RDF, Another issue was connected to data format: SPARQL and - probably the most promising mat- the translation environment, especially the usage ter of fact - OWL reasoning and try to follow the of RDF triples consisting of subject, predicate and Semantic Web standard tooling very strictly. object, might be regarded to be too much aligned at the very special and constructed problematic of 9 Outlook only a number of realtime problems. Sentences At this point in our research, we have not yet com- have to be somehow split into triples, which is bined existing MT technology, especially SMT, quite an artificial border - not to say a technical with SWMT. The combination of approaches has limitation - of RDF. Real world NLP surely does yet to be explored, but existing MT technologies not fit into the tripartite simplifications of RDF, and SWMT are certainly not mutually exclusive and the question is then how often real world and we suspect that a combination of MT ap- problems would benefit from this solution. proaches will lead to yet even better results, es- Another issue is the Open World Assumption, pecially in cases where the translation quality is built into each Semantic Web component: There based on world or expert knowledge. is no golden standard truth in the Semantic Web and therefore we will never be able to find the 10 Conclusion ”best” translation for a given sentence within SPARQL-queries or inferencing results. Prob- In the recent past, MT researchers have already ably, our approach does not hold for providing discussed the combination of RBMT and SMT complete translation solutions, but for giving (Hutchins 2009, pp. 13-20). We suggest to very qualified suggestions. Some SMT tools, like add yet another possibility in MT to existing MT Moses, actually do work with suggestions. approaches, namely a Semantic Web based MT (SWMT).

However, some of this issues might be solved 26http://www.monnet-project.eu/Monnet by applying more sophisticated NLP / MT tech- /Monnet/English?init=true (URL last access nology, like n-grams. Besides these issues, the 2012-01-26).

8 In this paper we have taken a next logical URL: http://www.newelectronics.co. step in MT technology by including not only the uk/electronics-technology/statist vast amounts of texts available in the Web to en- ical-machine-translation-to-enab hance MT quality applying statistical computa- le-universal-communication/33008/ Elita, N. & Birladeanu, A. (2005), ‘A first step in inte- tions across online texts and translations, but go- grating an EBMT into the Semantic Web’. ing one step further by looking at the power of and URL: www.mt-archive.info/MTS-2005- knowledge contained in the Semantic Web. Elita.pdf By taking advantage of the knowledge in the Graves, A. & Gutierrez, C. (2005), ‘Data representa- Web of the future, our approach of combining tions for WordNet : A case for RDF’. Semantic Web technology with MT allows this URL: http://www.dcc.uchile.cl/˜cgu world knowledge to be made available for ma- tierr/papers/wordnet-rdf.pdf Guha, R. (2004), ‘RDF Vocabulary Description Lang- chine translations, thus enhancing challenges in uage 1.0: RDF Schema’. MT, such as lexical ambiguities. In our discussed URL: http://moodletest.ncnu.ed sample sentences, we have shown that a solu- u.tw/file.php/9506/references- tion for the disambiguation would traditionally in- 2009/RDF\_schema\_1.pdf volve a costly disambiguation process or would Hutchins, J. (2009), ‘Multiple Uses of Machine Transla- be left unsolved. Using our SWMT approach, the tion and Computerised Translation Tools’, Machine MT quality benefits from world knowledge ex- Translation pp. 13–20. http://www.hutchinsweb.me.uk/ tracted from the Semantic Web and by its tech- URL: Besancon-2009.pdf nology. Klyne, G. & Carroll, J. (2004), ‘Resource description This combination of MT with Semantic Web framework (RDF): Concepts and abstract syntax’, technology results in a new focus of MT which Changes 10(February), 1–20. we suggest to be called Semantic Web based MT URL: http://www.mendeley.com/res (SWMT). earch/w3c-gibt-recommendation-fr- resource-description-framework- 11 Acknowledgements rdf-frei/ Knight, K. (1999), A statistical MT tutorial workbook, We like to thank Rike Bacher from Linguatec and in ‘Prepared for the 1999 JHU Summer Workshop’. the reviewers of the ESIRMT-HyTra conference URL: http://www.snlp.de/prescher/t 2012 for their valuable hints. eaching/2007/StatisticalNLP/bib/ 1999jhu.knight.pdf Knight, K. & Koehn, P. (2004), ‘What’s New in Statis- References tical Machine Translation’, Tutorial, HLT/NAACL pp. 1–89. Allemang, D. & Hendler, J. (2008), Semantic Web URL: http://www.auai.org/uai2003/ for the Working Ontologist: Effective Modeling in Knight-UAI-03.pdf RDFS and OWL, Morgan Kaufmann. Koehn, P., Osborne, M., Haddow, B., Auli, M., Buck, URL: http://www.amazon.com/Se C., Dugast, L., Guillou, L., Hasler, E., Matthews, mantic-Web-Working-Ontologist- D., Williams, P., Wilson, O. & Saint-Amand, H. Effective/dp/0123735564 (2012), ‘Statistical Machine Translation at the Uni- Berners-Lee, T. (1999), Weaving the Web : The Origi- versity of Edinburgh’. nal Design and Ultimate Destiny of the World Wide Kunze, C. & Lungen,¨ H. (2007), ‘Reprasentation¨ und Web by its Inventor, HarperOne. Verknupfung¨ allgemeinsprachlicher und terminolo- URL: http://www.amazon.com/We gischer Wortnetze in OWL’, Zeitschrift fur¨ Sprach- aving-Web-Original-Ultimate- wissenschaft . Inventor/dp/0062515861 Mark van Assem, V. U. A., Aldo Gangemi, ISTC-CNR, Berners-Lee, T. & Hendler, J. (2001), ‘Scientific Amer- R. & Guus Schreiber, V. U. A. (2006), ‘RDF/OWL ican: The Semantic Web’, Scientific American, Representation of WordNet’. USA. . URL: http://www.w3.org/TR/wordnet- URL: http://scholar.google.com rdf/ /scholar?hl=en\&btnG=Search\& Miles, A. (2008), ‘SKOS simple knowledge organiza- q=intitle:Scientific+American: tion system reference’, W3C Recommendation . +The+Semantic+Web\#3 URL: http://www.mendeley.com/r Boothroyd, D. (2011), ‘Statistical machine translation esearch/skos-simple-knowledge- to enable universal communication?’. organization-system-reference/

9 Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents

Fangzhong Su Bogdan Babych Centre for Translation Studies Centre for Translation Studies University Of Leeds University Of Leeds LS2 9JT, Leeds, UK LS2 9JT, Leeds, UK [email protected] [email protected]

Abstract Most of the applications of comparable cor- pora focus on discovering translation equivalents In this paper we present and evaluate three to support machine translation, such as bilingual approaches to measure comparability of lexicon extraction (Rapp, 1995; Rapp, 1999; documents in non-parallel corpora. We de- Morin et al., 2007; Yu and Tsujii, 2009; Li and velop a task-oriented definition of compa- Gaussier, 2010; Prachasson and Fung, 2011), par- rability, based on the performance of auto- allel phrase extraction (Munteanu and Marcu, matic extraction of translation equivalents from the documents aligned by the pro- 2006), and parallel sentence extraction (Fung and posed metrics, which formalises intuitive Cheung, 2004b; Munteanu and Marcu, 2005; definitions of comparability for machine Munteanu et al., 2004; Smith et al., 2010). translation research. We demonstrate ap- Comparability between documents is often un- plication of our metrics for the task of derstood as belonging to the same subject domain, automatic extraction of parallel and semi- parallel translation equivalents and discuss genre or text type, so this definition relies on these how these resources can be used in the vague linguistic concepts. The problem with this frameworks of statistical and rule-based definition then is that it cannot be exactly bench- machine translation. marked, since it becomes hard to relate automated measures of comparability to such inexact and un- measurable linguistic concepts. Research on com- 1 Introduction parable corpora needs not only good measures for comparability, but also a clearer, technologically- Parallel corpora have been extensively exploited grounded and quantifiable definition of compara- in different ways in machine translation (MT) bility in the first place. — both in Statistical (SMT) and more recently, in Rule-Based (RBMT) architectures: in SMT In this paper we relate comparability to use- aligned parallel resources are used for building fulness of comparable texts for MT. In particu- translation phrase tables and calculating transla- lar, we propose a performance-based definition of tion probabilities; and in RBMT, they are used comparability, as the possibility to extract parallel for automatically building bilingual dictionaries or quasi-parallel translation equivalents – words, of translation equivalents and automatically deriv- phrases and sentences which are translations of ing bilingual mappings for frequent structural pat- each other. This definition directly relates compa- terns. However, large parallel resources are not rability to texts’ potential to improve the quality always available, especially for under-resourced of MT by adding extracted phrases to phrase ta- languages or narrow domains. Therefore, in re- bles, training corpus or dictionaries. It also can be cent years, the use of cross-lingual comparable quantified as the rate of successful extraction of corpora has attracted considerable attention in translation equivalents by automated tools, such the MT community (Sharoff et al., 2006; Fung as proposed in Munteanu and Marcu (2006). and Cheung, 2004a; Munteanu and Marcu, 2005; Still, successful detection of translation equiv- Babych et al., 2008). alents from comparable corpora very much de-

10 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 10–19, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics pends on the quality of these corpora, specifically approach, a keyword based approach, and a ma- on the degree of their textual equivalence and suc- chine translation based approach. The experimen- cessful alignment on various text units. There- tal results show that all of them can effectively fore, the goal of this work is to provide compa- predict the comparability levels of the compared rability metrics which can reliably identify cross- document pairs. We then further investigate the lingual comparable documents from raw corpora applicability of the proposed metrics by measur- crawled from the Web, and characterize the de- ing their impact on the task of parallel phrase ex- gree of their similarity, which enriches compara- traction from comparable corpora. It turns out ble corpora with the document alignment infor- that, higher comparability level predicted by the mation, filters out documents that are not useful metrics consistently lead to more number of paral- and eventually leads to extraction of good-quality lel phrase extracted from comparable documents. translation equivalents from the corpora. Thus, the metrics can help select more compara- To achieve this goal, we need to define a ble document pairs to improve the performance of scale to assess comparability qualitatively, met- parallel phrase extraction. rics to measure comparability quantitatively, and The remainder of this paper is organized as fol- the sources to get comparable corpora from. In lows. Section 2 discusses previous work. Section this work, we directly characterize comparability 3 introduces our comparability metrics. Section by how useful comparable corpora are for the task 4 presents the experimental results and evaluation. of detecting translation equivalents in them, and Section 5 describes the application of the metrics. ultimately to machine translation. We focus on Section 6 discusses the pros and cons of the pro- document-level comparability, and use three cat- posed metrics, followed by conclusions and future egories for qualitative definition of comparability work in Section 7. levels, defined in terms of granularity for possible alignment: 2 Related Work The term “comparability”, which is the key con- • Parallel: Traditional parallel texts that are cept in this work, applies to the level of corpora, translations of each other or approximate documents and sub-document units. However, so translations with minor variations, which can far there is no widely accepted definition of com- be aligned on the sentence level. parability. For example, there is no agreement on the degree of similarity that documents in com- • Strongly-comparable: Texts that talk about parable corpora should have or on the criteria for the same event or subject, but in different measuring comparability. Also, most of the work languages. For example, international news that performs translation equivalent extraction in about oil spill in the Gulf of Mexico, or comparable corpora usually assumes that the cor- linked articles in Wikipedia about the same pora they use are reliably comparable and focuses topic. These documents can be aligned on on the design of efficient extraction algorithms. the document level on the basis of their ori- Therefore, there has been very little literature dis- gin. cussing the characteristics of comparable corpora • Weakly-comparable: Texts in the same sub- (Maia, 2003). In this section, we introduce some ject domain which describe different events. representative work which tackles comparability For example, customer reviews about hotel metrics. and restaurant in London. These documents Some studies (Sharoff, 2007; Maia, 2003; do not have an independent alignment across McEnery and Xiao, 2007) analyse comparability languages, but sets of texts can be aligned by assessing corpus composition, such as struc- on the basis of belonging to the same subject tural criteria (e.g., format and size), and linguistic domain or sub-domain. criteria (e.g., topic, domain, and genre). Kilgarriff and Rose (1998) measure similarity and homo- In this paper, we present three different ap- geneity between monolingual corpora. They gen- proaches to measure the comparability of cross- erate word frequency list from each corpus and lingual (especially under-resourced languages) then apply χ2 statistic on the most frequent n (e.g., comparable documents: a lexical mapping based 500) words of the compared corpora.

11 The work which deals with comparability comparability of comparable documents. measures in cross-lingual comparable corpora is closer to our work. Saralegi et al. (2008) measure 3.1 Lexical mapping based metric the degree of comparability of comparable cor- It is straightforward that we expect a bilingual dic- pora (English and Basque) according to the dis- tionary can be used for lexical mapping between a tribution of topics and publication dates of docu- language pair. However, unlike the language pairs ments. They compute content similarity for all the in which both languages are rich-resourced (e.g., document pairs between two corpora. These sim- English-French, or English-Spanish) and dictio- ilarity scores are then input as parameters for the nary resources are relatively easy to obtain, it is EMD (Earth Mover’s Distance) distance measure, likely that bilingual dictionaries with good word which is employed to calculate the global com- coverage are not publicly available for under- patibility of the corpora. Munteanu and Marcu resourced languages (e.g., English-Slovenian, or (2005; 2006) select more comparable document English-Lithuanian). In order to address this pairs in a cross-lingual information retrieval based problem, we automatically construct dictionaries 1 manner by using a toolkit called Lemur . The by using word alignment on large-scale parallel retrieved document pairs then serve as input for corpora (e.g., Europarl and JRC-Acquis2). the tasks of parallel sentence and sub-sentence ex- Specifically, GIZA++ toolkit (Och and Ney, traction. Smith et al. (2010) treat Wikipedia as 2000) with default setting is used for word align- a comparable corpus and use “interwiki” links to ment on the JRC-Acquis parallel corpora (Stein- identify aligned comparable document pairs for berger et al., 2006). The aligned word pairs to- the task of parallel sentence extraction. Li and gether with the alignment probabilities are then Gaussier (2010) propose a comparability met- converted into dictionary entries. For example, ric which can be applied at both document level in Estonian-English language pair, the alignment and corpus level and use it as a measure to se- example “kompanii company 0.625” in the word lect more comparable texts from other external alignment table means the Estonian word “kom- sources into the original corpora for bilingual lex- panii” can be translated as (or aligned with) the icon extraction. The metric measures the propor- English candidate word “company” with a prob- tion of words in the source language corpus trans- ability of 0.625. In the dictionary, the transla- lated in the target language corpus by looking up tion candidates are ranked by translation proba- a bilingual dictionary. They evaluate the met- bility in descending order. Note that the dictio- ric on the rich-resourced English- nary collects inflectional form of words, but not pair, thus good dictionary resources are available. only base form of words. This is because the dic- However, this is not the case for under-resourced tionary is directly generated from the word align- languages in which reliable language resources ment results and no further word lemmatization is such as machine-readable bilingual dictionaries applied. with broad word coverage or word lemmatizers Using the resulting dictionary, we then per- might be not publicly available. form lexical mapping in a word-for-word map- ping strategy. We scan each word in the source 3 Comparability Metrics language texts to check if it occurs in the dic- To measure the comparability degree of document tionary entries. If so, the first translation candi- pairs in different languages, we need to translate date are recorded as the corresponding mapping the texts or map lexical items from the source lan- word. If there are more than one translation can- guage into the target languages so that we can didate, the second candidate will also be kept as compare them within the same language. Usually the mapping result if its translation probability is this can be done by using bilingual dictionaries higher than 0.33. For non-English and English (Rapp, 1999; Li and Gaussier, 2010; Prachasson 2The JRC-Acquis covers 22 European languages and and Fung, 2011) or existing machine translation provides large-scale parallel corpora for all the 231 language tools. Based on this process, in this section we pairs. present three different approaches to measure the 3From the manual inspection on the word alignment re- sults, we find that if the alignment probability is higher than 1Available at http://www.lemurproject.org/ 0.3, it is more reliable.

12 language pair, the non-English texts are mapped word lists. into English. If both languages are non-English (e.g., Greek-Romanian), we use English as a pivot 3.3 Machine translation based metrics langauge and map both the source and target Bilingual dictionary is used for word-for-word language texts into English4. Due to the lack translation in the lexical mapping based metric of reliable linguistic resources in non-English and words which do not occur in the dictionary languages, mapping texts from non-English lan- will be omitted. Thus, the mapping result is like guage into English can avoid language process- a list of isolated words and information such as ing in non-English texts and allows us to make word order, syntactic structure and named entities use of the rich resources in English for further can not be preserved. Therefore, in order to im- text processing, such as stop-word filtering and prove the text translation quality, we turn to the word lemmatization5. Finally, cosine similarity state-of-the-art SMT systems. measure is applied to compute the comparability In practice, we use Microsoft translation API6 strength of the compared document pairs. to translate texts in under-resourced languages (e.g, Lithuanian and Slovenian) into English and 3.2 Keyword based metric then explore several features for comparability The lexical mapping based metric takes all the metric design, which are listed as below. words in the text into account for comparability measure, but if we only retain a small number of • Lexical feature: Lemmatized bag-of-word representative words (keywords) and discard all representation of each document after stop- the other less informative words in each docu- word filtering. Lexical similarity (denoted ment, can we judge the comparability of a doc- by WL) of each document pair is then ob- ument pair by comparing these words? Our in- tained by applying cosine measure to the lex- tuition is that, if two document share more key- ical feature. words, they should be more comparable. To • Structure feature: We approximate it by validate this, we then perform keyword extrac- the number of content words (adjectives, ad- tion by using a simple TFIDF based approach, verbs, nouns, verbs and proper nouns) and which has been shown effective for keyword or the number of sentences in each document, keyphrase extraction from the texts (Frank et al., denoted by CD and SD respectively. The in- 1999; Hulth, 2003; Liu et al., 2009). tuition is that, if two documents are highly More specifically, the keyword based metric comparable, their number of content words can be described as below. First, similar to the and their document length should be similar. lexical mapping based metric, bilingual dictionar- The structure similarity (denoted by WS) of ies are used to map non-English texts into En- two documents D1 and D2 is defined as bel- glish. Thus, only the English resources are ap- low. plied for stop-word filtering and word lemmatiza- tion, which are useful text preprocessing steps for W = 0.5 ∗ (C /C ) + 0.5 ∗ (S /S ) keyword extraction. We then use TFIDF to mea- S D1 D2 D1 D2 sure the weight of words in the document and rank suppose that C <=C , and S <=S . the words by their TFIDF weights in descending D1 D2 D1 D2 order. The top n (e.g., 30) words are extracted • Keyword feature: Top-20 words (ranked by as keywords to represent the document. Finally, TFIDF weight) of each document. keyword the comparability of each document pair is deter- similarity (denoted by WK ) of two docu- mined by applying cosine similarity to their key- ments is also measured by cosine.

4 Generally in JRC-Acquis, the size of parallel corpora • Named entity feature: Named entities of for most of non-English langauge pairs is much smaller than that of language pairs which contain English. Therefore, the each document. If more named entities co- resulting bilingual dictionaries which contain English have occur in two documents, they are very likely better word coverage as they have many more dictionary en- to talk about the same event or subject and tries. 5We use WordNet (Fellbaum, 1998) for word lemmatiza- 6Available at http://code.google.com/p/microsoft- tion. translator-java-api/

13 thus should be more comparable. We use Language #document parallel strongly- weakly- Stanford named entity recognizer7 to extract pair pair comparable comparable DE-EN 1286 531 715 40 named entities from the texts (Finkel et al., ET-EN 1648 182 987 479 2005). Again, cosine is then applied to mea- LT-EN 1177 347 509 321 sure the similarity of named entities (denoted LV-EN 1252 184 558 510 by WN ) between a document pair. SL-EN 1795 532 302 961 EL-RO 485 38 365 82 We then combine these four different types of score in an ensemble manner. Specifically, a Table 1: Data distribution of gold standard corpora weighted average strategy is applied: each indi- vidual score is associated with a constant weight, standard comparability labels. In addition, in or- indicating the relative confidence (importance) of der to better reveal the relation between the scores the corresponding type of score. The overall com- obtained from the proposed metrics and compara- parability score (denoted by SC) of a document bility levels, we also measure the Pearson correla- pair is thus computed as below: tion between them8. For the keyword based met- SC = α ∗ WL + β ∗ WS + γ ∗ WK + δ ∗ WN ric, top 30 keywords are extracted from each text for experiment. For the machine translation based where α, β, γ, and δ ∈ [0, 1], and α+β +γ +δ = metric, we empirically set α = 0.5, β = γ = 0.2, 1. SC should be a value between 0 and 1, and and δ = 0.1. This is based on the assumption larger SC value indicates higher comparability that, lexical feature can best characterize the com- level. parability given the good translation quality pro- 4 Experiment and Evaluation vided by the powerful MT system, while keyword and named entity features are also better indica- 4.1 Data source tors of comparability than the simple document length information. To investigate the reliability of the proposed comparability metrics, we perform experiments The results for the lexical mapping based met- for 6 language pairs which contain under- ric, the keyword based metric and the machine resoured languages: German-English (DE-EN), translation based metric are listed in Table 2, 3, Estonian-English (ET-EN), Lithuanian-English and 4, respectively. (LT-EN), Latvian-English (LV-EN), Slovenian- Language parallel strongly- weakly- correlation English (SL-EN) and Greek-Romanian (EL-RO). pair comparable comparable A comparable corpus is collected for each lan- DE-EN 0.545 0.476 0.182 0.941 guage pair. Based on the definition of compa- ET-EN 0.553 0.381 0.228 0.999 rability levels (see Section 1), human annota- LT-EN 0.545 0.461 0.225 0.964 tors fluent in both languages then manually anno- LV-EN 0.625 0.494 0.179 0.973 SL-EN 0.535 0.456 0.314 0.987 tated the comparability degree (parallel, strongly- EL-RO 0.342 0.131 0.090 0.932 comparable, and weakly-comparable) at the doc- ument level. Hence, these bilingual comparable Table 2: Average comparability scores for lexical map- corpora are used as gold standard for experiments. ping based metric The data distribution for each language pair, i.e., number of document pairs in each comparability Overall, from the average scores for each level, is given in Table 1. comparability level presented in Table 2, 3, and 4, we can see that, the scores obtained 4.2 Experimental results from the three comparability metrics can reli- We adopt a simple method for evaluation. For 8For correlation measure, we use numerical calibration each language pair, we compute the average to different comparability degrees: “Parallel”, “strongly- scores for all the document pairs in the same com- comparable” and “weakly-comparable” are converted as 3, parability level, and compare them to the gold 2, and 1, respectively. The correlation is then computed between the numerical comparability levels and the cor- 7Available at http://nlp.stanford.edu/software/CRF- responding average comparability scores automatically de- NER.shtml rived from the metrics.

14 Language parallel strongly- weakly- correlation lapping and mining more useful features in the pair comparable comparable translated texts. Thirdly, in the lexical mapping DE-EN 0.526 0.486 0.084 0.941 ET-EN 0.502 0.345 0.184 0.990 based metric and keyword based metric, we can LT-EN 0.485 0.420 0.202 0.954 also see that, although the average scores for EL- LV-EN 0.590 0.448 0.124 0.975 RO (both under-resourced languages) conform to SL-EN 0.551 0.505 0.292 0.937 the comparability levels, they are much lower than EL-RO 0.210 0.110 0.031 0.997 those of the other 5 language pairs. The reason is that, the size of the parallel corpora in JRC- Table 3: Average comparability scores for keyword Acquis for these 5 language pairs are significantly based metric larger (over 1 million parallel sentences) than that 9 Language parallel strongly- weakly- correlation of EL-EN, RO-EN , and EL-RO, thus the result- pair comparable comparable ing dictionaries of these 5 language pairs also con- DE-EN 0.912 0.622 0.326 0.999 tain many more dictionary entries. ET-EN 0.765 0.547 0.310 0.999 LT-EN 0.755 0.613 0.308 0.984 LV-EN 0.770 0.627 0.236 0.966 5 Application SL-EN 0.779 0.582 0.373 0.988 EL-RO 0.863 0.446 0.214 0.988 The experiments in Section 4 confirm the reli- ability of the proposed metrics. The compara- Table 4: Average comparability scores for machine bility metrics are thus useful for collecting high- translation based metric quality comparable corpora, as they can help filter out weakly comparable or non-comparable doc- ument pairs from the raw crawled corpora. But ably reflect the comparability levels across dif- are they also useful for other NLP tasks, such as ferent language pairs, as the average scores translation equivalent detection from comparable for higher comparable levels are always sig- corpora? In this section, we further measure the nificantly larger than those of lower compara- impact of the metrics on parallel phrase extraction ble levels, namely SC(parallel)>SC(strongly- (PPE) from comparable corpora. Our intuition is comparable)>SC(weakly-comparable). In addi- that, if document pairs are assigned higher com- tion, in all the three metrics, the Pearson correla- parability scores by the metrics, they should be tion scores are very high (over 0.93) across dif- more comparable and thus more parallel phrases ferent language pairs, which indicate that there can be extracted from them. is strong correlation between the comparability scores obtained from the metrics and the corre- The algorithm of parallel phrase extraction, sponding comparability level. which develops the approached presented in Munteanu and Marcu (2006), uses lexical over- Moreover, from the comparison of Table 2, 3, lap and structural matching measures (Ion, 2012). and 4, we also have several other findings. Firstly, Taking a list of bilingual comparable document the performance of keyword based metric (see pairs as input, the extraction algorithm involves Table 3) is comparable to the lexical mapping the following steps. based metric (see Table 2) as their comparability scores for the corresponding comparability levels are similar. This means it is reasonable to deter- 1. Split the source and target language docu- mine the comparability level by only comparing a ments into phrases. small number of keywords of the texts. Secondly, the scores obtained from the machine translation 2. Compute the degree of parallelism for each based metric (see Table 4) are significantly higher candidate pair of phrases by using the bilin- than those in both the lexical mapping based met- gual dictionary generated from GIZA++ ric and the keyword based metric. Clearly, this (base dictionary), and retain all the phrase is due to the advantages of using the state-of-the- pairs with a score larger than a predefined art MT system. In comparison to the approach parallelism threshold. of using dictionary for word-for-word mapping, it can provide much better text translation which 9Remember that in our experiment, English is used as the allows detecting more proportion of lexical over- pivot language for non-English langauge pairs.

15 3. Apply GIZA++ to the retained phrase pairs (1) 0.1<=SC<0.3, (2) 0.3<=SC<0.5, and (3) to detect new dictionary entries and add them SC>=0.5. For the lexical mapping based metric to the base dictionary. and keyword based metric, since their scores are lower than those of the MT based metric for each 4. Repeat Step 2 and 3 for several times (empir- comparability level, we set three lower intervals at ically set at 5) by using the augmented dic- (1) 0.1<=SC<0.2, (2) 0.2<=SC<0.4, and (3) tionary, and output the detected phrase pairs. SC>=0.4. The experiment focuses on counting the number of extracted parallel phrases with par- 10 Phrases which are extracted by this algorithm allelism score>=0.4 , and computes the average are frequently not exact translation equivalents. number of extracted phrases per 100000 words Below we give some English-German examples (the sum of words in the source and target lan- of extracted equivalents with their corresponding guage documents) for each interval. In addition, alignment scores: the Pearson correlation measure is also applied to measure the correlation between the interval11 of 1. But a successful mission — seiner uberaus¨ comparability scores and the number of extracted erfolgreichen Mission abgebremst — parallel phrases. The results which summarize the 0.815501989333333 impact of the three metrics to the performance of parallel phrase extraction are listed in Table 5, 6, 2. Former President Jimmy Carter — Der and 7, respectively. ehemalige US-Prasident¨ Jimmy Carter — 0.69708324976825 Language 0.1<= 0.2<= SC>=0.4 correlation pair SC<0.2 SC<0.4 3. on the Korean Peninsula — auf der koreanis- DE-EN 728 1434 2510 0.993 chen Halbinsel — 0.8677432145 ET-EN 313 631 1166 0.989 LT-EN 258 419 894 0.962 4. across the Muslim world — mit der muslim- LV-EN 470 859 1900 0.967 ischen Welt ermoglichen¨ — 0.893330864 SL-EN 393 946 2220 0.975

5. to join the United Nations — der Weg Table 5: Impact of the lexical mapping based metric to in die Vereinten Nationen offensteht — parallel phrase extraction 0.397418711927629

Even though some of the extracted phrases are Language 0.1<= 0.2<= SC>=0.4 correlation not exact translation equivalents, they may still pair SC<0.2 SC<0.4 DE-EN 1007 1340 2151 0.972 be useful resources both for SMT and RBMT if ET-EN 438 650 1050 0.984 these phrases are passed through an extra pre- LT-EN 306 442 765 0.973 processing stage, of if the engines are modified LV-EN 600 966 1722 0.980 specifically to work with semi-parallel translation SL-EN 715 1026 1854 0.967 equivalents extracted from comparable texts. We address this issue in the discussion section (see Table 6: Impact of the keyword based metric to parallel Section 6). phrase extraction For evaluation, we measure how the metrics af- From Table 5, 6, and 7, we can see that fect the performance of parallel phrase extraction for all the 5 language pairs, based on the aver- algorithm on 5 language pairs (DE-EN, ET-EN, age number of extracted aligned phrases, clearly LT-EN, LV-EN, and SL-EN). A large raw compa- we have interval (3)>(2)>(1). In other words, in rable corpus for each language pair was crawled any of the three metrics, a higher comparability from the Web, and the metrics were then applied level always leads to significantly more number to assign comparability scores to all the docu- ment pairs in each corpus. For each language pair, 10A manual evaluation of a small set of extracted data we set three different intervals based on the com- shows that parallel phrases with parallelism score >=0.4 are parability score (SC) and randomly select 500 more reliable. 11For the purpose of correlation measure, the three inter- document pairs in each interval for evaluation. vals are numerically calibrated as “1”, “2”, and “3”, respec- For the MT based metric, the three intervals are tively.

16 Language 0.1<= 0.3<= SC>=0.5 correlation are usually relevant to subject and domain terms, pair SC<0.3 SC<0.5 which is quite useful in judging the comparabil- DE-EN 861 1547 2552 0.996 ET-EN 448 883 1251 0.999 ity of two documents. Both the lexical mapping LT-EN 293 483 1070 0.959 based approach and the keyword based approach LV-EN 589 1072 2037 0.982 use dictionary for lexical translation, thus rely on SL-EN 560 1151 2421 0.979 the availability and completeness of the dictionary resources or large scale parallel corpora. Table 7: Impact of the machine translation based met- For the machine translation based metric, it ric to parallel phrase extraction provides much better text translation than the dictionary-based approach so that the comparabil- of aligned phrases extracted from the comparable ity of two document can be better revealed from documents. Moreover, although the lexical map- the richer lexical information and other useful ping based metric and the keyword based metric features, such as named entities. However, the produce lower comparability scores than the MT text translation process is expensive, as it depends based metric (see Section 4), they have similar on the availability of the powerful MT systems12 impact to the task of parallel phrase extraction. and takes much longer than the simple dictionary This means, the comparability score itself does based translation. not matter much, as long as the metrics are re- In addition, we use a translation strategy of liable and proper thresholds are set for different translating texts from under-resourced (or less- metrics. resourced) languages into rich-resourced lan- In all the three metrics, the Pearson correla- guage. In case that both languages are under- tion scores are very close to 1 for all the language resourced languages, English is used as the pivot pairs, which indicate that the intervals of compa- langauge for translation. This can compensate the rability scores obtained from the metrics are in shortage of the linguistic resources in the under- line with the performance of equivalent extrac- resourced languages and take advantages of vari- tion algorithm. Therefore, in order to extract more ous resources in the rich-resourced languages. parallel phrases (or other translation equivalents) from comparable corpora, we can try to improve 6.2 Using semi-parallel equivalents in MT the corpus comparability by applying the compa- systems rability metrics beforehand to add highly compa- We note that modern SMT and RBMT sys- rable document pairs in the corpora. tems take maximal advantage of strictly parallel phrases, but they still do not use full potential 6 Discussion of the semi-parallel translation equivalents, of the We have presented three different approaches to type that is illustrated in the application section measure comparability at the document level. In (see Section 5). Such resources, even though they this section, we will analyze the advantages and are not exact equivalents contain useful informa- limitations of the proposed metrics, and the feasi- tion which is not used by the systems. bility of using semi-parallel equivalents in MT. In particular, the modern decoders do not work with under-specified phrases in phrase tables, and 6.1 Pros and cons of the metrics do not work with factored semantic features. For Using bilingual dictionary for lexical mapping is example, the phrase: simple and fast. However, as it adopts the word- But a successful mission — seiner uberaus¨ er- for-word mapping strategy and out-of-vocabulary folgreichen Mission abgebremst (OOV) words are omitted, the linguistic structure The English side contains the word but, which of the original texts is badly hurt after mapping. pre-supposes contrast, and on the Greman side Thus, apart from lexical information, it is diffi- words uberaus¨ erfolgreichen (“generally success- cult to explore more useful features for the com- ful”) and abgebremst (“slowed down”) – which parability metrics. The TFIDF based keyword ex- taken together exemplify a contrast, since they traction approach allows us to select more repre- 12Alternatively, we can also train MT systems for text sentative words and prune a large amount of less translation by using the available SMT toolkits (e.g., Moses) informative words from the texts. The keywords on large scale parallel corpora.

17 have different semantic prosodies. In this example Christiane Fellbaum. 1998. WordNet: An Electronic the semantic feature of contrast can be extracted Lexical Database. MIT Press, Cambridge, MA, and reused in other contexts. However, this would USA. require the development of a new generation of Jenny Finkel, Trond Grenager, and Christopher Man- ning. 2005. Incorporating Non-local Information decoders or rule-based systems which can suc- into Information Extraction Systems by Gibbs Sam- cessfully identify and reuse such subtle semantic pling. Proceedings of ACL 2005, University of features. Michigan, Ann Arbor, USA. Eibe Frank, Gordon Paynter and Ian Witten. 1999. 7 Conclusion and Future work Domain-specific keyphrase extraction. Proceedings of IJCAI 1999, Stockholm, Sweden. The success of extracting good-quality translation Pascale Fung and Percy Cheung. 2004a. Mining very equivalents from comparable corpora to improve non-parallel corpora: Parallel sentence and lexicon machine translation performance highly depends extraction via bootstrapping and EM. Proceedings on “how comparable” the used corpora are. In this of EMNLP 2004, Barcelona, Spain. paper, we propose three different comparability Pascale Fung and Percy Cheung. 2004b. Multi-level measures at the document level. The experiments bootstrapping for extracting parallel sentences from show that all the three approaches can effectively a quasicomparable corpus. Proceedings of COL- ING 2004, Geneva, Switzerland. determine the comparability levels of comparable Anette Hulth. 2003. Improved Automatic Keyword document pairs. We also further investigate the Extraction Given More Linguistic Knowledge. Pro- impact of the metrics on the task of parallel phrase ceedings of EMNLP 2003, Sapporo, Japan. extraction from comparable corpora. It turns out Radu Ion. 2012. PEXACC: A Parallel Data Mining that higher comparability scores always lead to Algorithm from Comparable Corpora. Proceedings significantly more parallel phrases extracted from of LREC 2012, Istanbul, Turkey. comparable documents. Since better quality of Adam Kilgarriff and Tony Rose. 1998. Measures for corpus similarity and homogeneity. Proceedings of comparable corpora should have better applica- EMNLP 1998, Granada, Spain. bility, our metrics can be applied to select highly Bo Li and Eric Gaussier. 2010. Improving cor- comparable document pairs for the tasks of trans- pus comparability for bilingual lexicon extraction lation equivalent extraction. from comparable corpora. Proceedings of COL- In the future work, we will conduct more com- ING 2010, Beijing, China. prehensive evaluation of the metrics by capturing Feifan Liu, Deana Pennell, Fei Liu and Yang Liu. its impact to the performance of machine transla- 2009. Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts. tion systems with extended phrase tables derived Proceedings of NAACL 2009, Boulder, Colorado, from comparable corpora. USA. Belinda Maia. 2003. What are comparable corpora? Acknowledgments Proceedings of the Corpus Linguistics workshop on We thank Radu Ion at RACAI for providing us Multilingual Corpora: Linguistic requirements and the toolkit of parallel phrase extraction, and the technical perspectives, 2003, Lancaster, U.K. three anonymous reviewers for valuable com- Anthony McEnery and Zhonghua Xiao. 2007. Par- ments. This work is supported by the EU funded allel and comparable corpora? In Incorporating ACCURAT project (FP7-ICT-2009-4-248347) at Corpora: Translation and the Linguist. Translating the Centre for Translation Studies, University of Europe. Multilingual Matters, Clevedon, UK. Leeds. Emmanuel Morin, Beatrice Daille, Korchi Takeuchi and Kyo Kageura. 2007. Bilingual terminology mining — using brain, not brawn comparable cor- References pora. Proceedings of ACL 2007, Prague, Czech Re- public. Bogdan Babych, Serge Sharoff and Anthony Hartley. Dragos Munteanu and Daniel Marcu. 2006. Ex- 2008. Generalising Lexical Translation Strategies tracting parallel sub-sentential fragments from non- for MT Using Comparable Corpora. Proceedings parallel corpora. Proceedings of ACL 2006, Syn- of LREC 2008, Marrakech, Morocco. dey, Australia. Yun-Chuang Chiao and Pierre Zweigenbaum. 2002. Dragos Munteanu and Daniel Marcu. 2005. Improv- Looking for candidate translational equivalents in ing machine translation performance by exploiting specialized, comparable corpora. Proceedings of non-parallel corpora. Computational Linguistics, COLING 2002, Taipei, Taiwan. 31(4): 477-504.

18 Dragos Munteanu, Alexander Fraser and Daniel Marcu. 2004. Improved machine translation performance via parallel sentence extraction from comparable corpora. Proceedings of HLT-NAACL 2004, Boston, USA. Franz Och and Hermann Ney. 2000. Improved Statis- tical Alignment Models. Proceedings of ACL 2000, Hongkong, China. Emmanuel Prochasson and Pascale Fung. 2011. Rare Word Translation Extraction from Aligned Compa- rable Documents. Proceedings of ACL-HLT 2011, Portland, USA. Reinhard Rapp. 1995. Identifying Word Translation in Non-Parallel Texts. Proceedings of ACL 1995, Cambridge, Massachusetts, USA. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated English and Ger- man corpora. Proceedings of ACL 1999, College Park, Maryland, USA. Xabier Saralegi, Inaki Vicente and Antton Gurrutxaga. 2008. Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain. Proceedings of the Workshop on Compa- rable Corpora, LREC 2008, Marrakech, Morocco. Serge Sharoff. 2007. Classifying Web corpora into domain and genre using automatic feature identifi- cation. Proceedings of 3rd Web as Corpus Work- shop, Louvain-la-Neuve, Belgium. Serge Sharoff, Bogdan Babych and Anthony Hartley. 2006. Using Comparable Corpora to Solve Prob- lems Difficult for Human Translators. Proceedings of ACL 2006, Syndey, Australia. Jason Smith, Chris Quirk and Kristina Toutanova. 2010. Extracting Parallel Sentences from Compa- rable Corpora using Document Level Alignment. Proceedings of NAACL 2010, Los Angeles, USA. Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat and Dan Tufis. 2006. The JRC- Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of LREC 2006, Genoa, Italy. Kun Yu and Junichi Tsujii. 2009. Extracting bilingual dictionary from comparable corpora with depen- dency heterogeneity. Proceedings of HLT-NAACL 2009, Boulder, Colorado, USA.

19 Full Machine Translation for Factoid Question Answering

Cristina Espana-Bonet˜ and Pere R. Comas TALP Research Center Universitat Politecnica` de Catalunya (UPC) {cristinae,pcomas}@lsi.upc.edu

Abstract son names, numbers, dates, objects, etc.). For ex- ample, the question Q1545: What is a female rab- In this paper we present an SMT-based ap- bit called? is factoid and its answer, “doe”, is a proach to Question Answering (QA). QA semantic entity (although not a named entity). is the task of extracting exact answers in response to natural language questions. In Factoid questions written in natural language our approach, the answer is a translation of contain implicit information about the relations the question obtained with an SMT system. between the concepts expressed and the expected We use the n-best translations of a given outcomes of the search, and QA explicitly ex- question to find similar sentences in the ploits this information. Using an IR engine to document collection that contain the real look up a boolean query would not consider the answer. Although it is not the first time that relations therefore losing important information. SMT inspires a QA system, it is the first approach that uses a full Machine Transla- Consider the question Q0677: What was the name tion system for generating answers. Our ap- of the television show, starring Karl Malden, that proach is validated with the datasets of the had San Francisco in the title? and the candi- TREC QA evaluation. date answer A. In this question, two types of constraints are expressed over the candidate an- swers. One is that the expected type of A is a 1 Introduction kind of “television show.” The rest of the ques- Question Answering (QA) is the task of extract- tion indicates that “Karl Malden” is related to A ing short, relevant textual answers from a given as being “starred” by, and that “San Francisco” document collection in response to natural lan- is a substring of A. Many factoid questions ex- guage questions. QA extends IR techniques be- plicitly express an hyponymy relation about the cause it outputs concrete answers to a question answer type, and also several other relations de- instead of references to full documents which are scribing its context (i.e. spatial, temporal, etc.). relevant to a query. QA has attracted the attention The QA problem can be approached from sev- of researchers for some years, and several pub- eral points of view, ranging from simple surface lic evaluations have been recently carried in the pattern matching (Ravichandran and Hovy, 2002), TREC, CLEF, and NTCIR conferences (Dang et to automated reasoning (Moldovan et al., 2007) al., 2007; Penas˜ et al., 2011; Sakai et al., 2008). or supercomputing (Ferrucci et al., 2010). In All the example questions of this paper are ex- this work, we propose to use Statistical Machine tracted from the TREC evaluations. Translation (SMT) for the task of factoid QA. Un- QA systems are usually classified according to der this perspective, the answer is a translation of what kind of questions they can answer; factoid, the question. It is not the first time that SMT is definitional, how to or why questions are treated in used for QA tasks, several works have been us- a distinct way. This work focuses on factoid ques- ing translation models to determine the answers tions, that is, those questions whose answers are (Berger et al., 2000; Cui et al., 2005; Surdeanu semantic entities (e.g., organisation names, per- et al., 2011). But to our knowledge this is the first 20 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 20–29, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics approach that uses a full Machine Translation sys- translate the original query terms to the language tem for generating answers. of the answers, thus obtaining an expanded list of The paper is organised as follows: Section 2 terms usable in standard IR techniques. They also reviews the previous usages of SMT in QA, Sec- use SMT to perform question paraphrasing. In the tion 3 reports our theoretical approach to the task, same context, Lee et al. (2008) study methods for Section 4 describes our QA system, Section 5 improving the translation quality removing noise presents the experimental setting, Section 6 anal- from the parallel corpus. yses the results and Section 7 draws conclusions. SMT can be also applied to sentence represen- tations different than words. Cui et al. (2005) 2 Translation Models in QA approach the task of passage retrieval for QA The use of machine translation in IR is not new. with translations of dependency parsing relations. Berger and Lafferty (1999) firstly propose a prob- They extract the sequences of relations that link abilistic approach to IR based on methods of each pair of words in the question and, using the SMT. Under their perspective, the human user has IBM translation model 1, score their similarity an information need that is satisfied by an “ideal” to the relations extracted from the candidate pas- theoretical document d from which the user draws sage. Thus, an approximate relation matching important query words q. This process can be score is obtained. Surdeanu et al. (2011) extend mirrored by a translation model: given the query the scope of this approach by combining together q, they find the documents in the collection with the translation probabilities of words, dependency words a most likely to translate to q. The key relations, and semantic roles in the context of an- ingredient is the set of translation probabilities swer searching in FAQ collections. p(q|a) from IBM model 1 (Brown et al., 1993). The works we have described so far use In a posterior work, Berger et al. also intro- archives of question-answer pairs as information duce the formulation of the QA problem in terms sources. They are really doing document re- of SMT (Berger et al., 2000). They estimate the trieval and sentence retrieval rather than question likelihood that a given answer containing a word answering, because every document/sentence is ai corresponds to a question containing word known to be the answer of a question written in qj. This estimation relies on an IBM model 1. the form of an answer, and no further information The method is tested with a collection of closed- extraction is necessary, they just select the best domain Usenet and call-center questions, where answer from a given pool of answers. The dif- each question must be paired with one of the ference with a standard IR task is that these sys- recorded answers. Soricut and Brill (2004) im- tems are not searching for relevant documents but plement a similar strategy but with a richer for- for answer documents. In contrast, Echihabi and mulation and targeted to open-domain QA. Given Marcu (2003) introduce an SMT-based method a question Q, a web-search engine is used to for extracting the concrete answer in factoid QA. retrieve 3-sentence-long answer texts from FAQ First, they use a standard IR engine to retrieve pages. These texts are later ranked with the like- candidate sentences and process them with a con- lihood of containing the answer to Q, and this stituent parser. Then, an elaborated process sim- likelihood is estimated via a noisy-channel archi- plifies these parse trees converting them into se- tecture. The work of Murdock and Croft (2005) quences of relevant words and/or syntactic tags. applies the same strategy to TREC data. They This process reduces the length disparity between evaluate the TREC 2003 passage retrieval task. questions and answers. For the answer extraction, In this task, the system must output a single sen- a special tag marking the position of the answer tence containing the answer to a factoid ques- is sequentially added to all suitable positions in tion. Murdock and Croft tackle the length dis- the sentence, thus yielding several candidate an- parity in question-answer pairs and show that this swers for each sentence. Finally, each answer is MT-based approach outperforms traditional query rated according to its likelihood of being a trans- likelihood techniques. lation of the question, according to an IBM model Riezler et al. (2007) define the problem of an- 4 trained on a corpus of TREC and web-based swer retrieval from FAQ and social Q/A websites question-answer pairs. as a query expansion problem. SMT is used to With the exception of the query expansion ap- 21 proaches (Riezler et al., 2007), all works dis- form includes 8 terms: cussed here use some form of noisy-channel ˆ model (translation model and target language A(Q) = A = argmaxA log P (A|Q) = model) but do not perform the decoding part of + λlm log P (A) + λd log Pd(A, Q) the SMT process to generate translations, nor use + λlg log lex(Q|A) + λld log lex(A|Q) the rich set of features of a full SMT. In fact, the + λg log Pt(Q|A) + λd log Pt(A|Q) formulation of the noisy-channel in these works + λ log ph(A) + λ log w(A) , (2) has very few differences with pure language mod- ph w elling approaches to QA like the one of Heie et al. where P (A) is the language model probabil- (2011), where two different models for retrieval ity, lex(Q|A) and lex(A|Q) are the generative and filtering are learnt from a corpus of question- and discriminative lexical translation probabilities answer pairs. respectively, Pt(Q|A) the generative translation model, P (A|Q) the discriminative one, P (A, Q) 3 Question-to-Answer Translation t d the distortion model, and ph(A) and w(A) corre- The core of our QA system is an SMT system for spond to the phrase and word penalty models. We the Question-to-Answer language pair. In SMT, start by using this form for the answer probabil- the best translation for a given source sentence is ity and analyse the importance and validity of the the most probable one, and the probability of each terms in the experiments Section. The λ weights, translation is given by the Bayes theorem. In our which account for the relative importance of each case, the source sentence corresponds to the ques- feature in the log-linear probabilistic model, are tion Q and the target or translation is the sentence commonly estimated by optimising the translation containing the answer A. With this correspon- performance on a development set. For this opti- dence, the fundamental equation of SMT can be misation one may use Minimum Error Rate Train- written as: ing (MERT) (Och, 2003) where BLEU (Papineni et al., 2002) is the reference evaluation. ˆ A(Q) = A = argmaxA P (A|Q) Once the weights are determined and the prob- abilities estimated from a corpus of question- = argmaxA P (Q|A) P (A), (1) answer pairs (a parallel corpus in this task), a de- where P (Q|A) is the translation model and P (A) coder uses Eq. 2 to score the possible outputs and is the language model, and each of them can be to find the best answer sentence given a question understood as the sum of the probabilities for each or, in general, an n-best list of answers. of the segments or phrases that conform the sen- This formulation, although possible from an tence. The translation model quantifies the appro- abstract point of view, is not feasible in prac- priateness of each segment of Q being answered tice. The corpus from which probabilities are es- by A; the language model is a measure of the flu- timated is finite, and therefore new questions may ency of the answer sentence and does not take into not be represented. There is no chance that SMT account which is the question. Since we are in- can generate ex nihilo the knowledge necessary to terested in identifying the concrete string that an- answer questions such as Q1201: What planet has swers the question and not a full sentence, this the strongest magnetic field of all the planets?. probability is not as important as it is in the trans- So, rather than generating answers via translation, lation problem. we use translations as indicators of the sentence The log-linear model (Och and Ney, 2002), a context where an answer can be found. Context generalisation of the original noisy-channel ap- here has not only the meaning of near words but proach (Eq. 1), estimates the final probability as also a context at a higher level of abstraction. the logarithmic sum of several terms that depend To achieve this, we use two different represen- on both the question Q and the answer sentence tations of the question-answer pairs and two dif- A. Using just two of the features, the model re- ferent SMT models in our QA system. We call produces the noisy-channel approach but written Level1 representation the original strings of text in this way one can include as many features as of the question-answer pairs. The Level2 repre- desired at the cost of introducing the same number sentation, that aims at being more abstract, more of free parameters. The model in its traditional general and more useful in SMT, is constructed 22 applying this sequence of transformations: 1) Level1 Q: What is Karl Malone’s nickname ? Quoted expressions in the question are identified, Level1 A: Malone , whose overall consistency has earned paired with their counterpart in the answer (in him the nickname ANSWER , missed both of them with nine seconds remaining . case any exists) and substituted by a special tag QUOTED. 2) Each named entity is substituted Level2 Q: What STATIVE B-PERSON ’s COMMUNICA- by its entity class (e.g., “Karl Malone” by PER- TION ? SON). 3) Each noun and verb is substituted by Level2 A: B-PERSON , whose overall ATTRIBUTE POS- 1 SESSION POSSESSION him the COMMUNICATION their WordNet supersense (e.g. “nickname” by ANSWER , PERCEPTION both of them with B-NUM TIME COMMUNICATION). 4) Any remaining word, CHANGE . such as adjectives, adverbs and stop words, is left as is. Additionally, in the answer sentence string, the correct answer entity is substituted by a spe- Figure 1: Example of the two annotation levels used. cial tag ANSWER. An example of this annotation is given in Figure 1. tained from the document collection with straight- An SMT system trained with Level1 examples forward IR techniques and a list of candidate an- will translate Q to answer sentences with vocab- swers is generated. Finally, these candidate an- ulary and structure similar to the learning exam- swers are filtered and ranked to obtain a final list ples. The Level2 system will translate to a mix of of proposed answers. This pipeline is a common named entities, WordNet supersenses, bare words, architecture for a simple QA system. and ANSWER markers that represent the abstract structure of the answer sentence. We call patterns 4.1 Question Analysis to the Level2 translations. The rationale of this Questions are processed with a tokeniser, a POS process is that the SMT model can learn the con- tagger, a chunker, and a NERC. Besides, each text where answers appear depending of the struc- word is tagged with its most frequent sense in ture of the question. The obtained translations WordNet. Then, a maximum-entropy classi- from both levels can be searched in the document fier determines the most probable expected an- collection to find sentences that are very similar. swer types for the question (EAT). This classi- Note that in Level2, the vocabulary size of fier is built following the approach of Li and Roth the question-answer pairs is dramatically reduced (2005), it can classify questions into 53 different with respect to the original Level1 sentences, as answer types and belongs to our in-house QA sys- seen in Table 2. Thus, the sparseness is reduced, tem. Finally, a weighted list of relevant keywords and the translation model gains in coverage; pat- is extracted from the question. Their saliences are terns are also easier to find than Level1 sentences, heuristically determined: the most salient tokens and give flexibility and generality to the transla- are the quoted expressions, followed by named tion. And the most important feature, patterns entities, then sequences of nouns and adjectives, capture the context of the answer, pinpointing it then nouns, and finally verbs and any remaining with accuracy. non-stop word. This list is used in the candidate These Level1 and Level2 translations are the answer generation module. core of our QA system that is presented in the fol- 4.2 Candidate Answer Generation lowing Section. The candidate answer generation comprises two 4 The Question Answering System steps. First a set of passages is retrieved from the document collection, and then the candidate an- Our QA system is a pipeline of three modules. swers are extracted from the text. In the first one, the question is analysed and an- For the retrieval, we have used the passage notated with several linguistic processors. This retrieval module of our in-house QA system. information is used by the rest of the modules. The passage retrieval algorithm initially creates In the second one, relevant documents are ob- a boolean query with all nouns and more salient 1WordNet noun synsets are organised in 26 semantic cat- words, and sets a threshold t to 50. It uses the egories based on logical groupings, e.g., ARTIFACT, ANI- Lucene IR engine2 to fetch the documents match- MAL, BODY, COMMUNICATION.. . The verbs are organ- ised in 15 categories. (Fellbaum, 1998) 2http://lucene.apache.org 23 ing the current query and a subsequent passage ing algorithm with off-the-shelf components. construction module extracts passages as docu- Although these scores can determine if a sen- ment segments where two consecutive keyword tence is a candidate for asserting a certain prop- occurrences are separated by at most t words. erty of a certain object, they do not have the power If too few or too many passages are obtained to discriminate if these objects are the actually re- this way, a relaxation procedure is applied. The quired by the question. Level2 representation is process iteratively adjusts the salience level of very coarse and, for example, treats all named en- the keywords used in the query by dropping low tities of the same categories as the same word. salient words when too few are obtained or adding Thus, it is prone to introduce noise in the form them when too many, and it also adjusts their of totally irrelevant answers. For example, con- proximity threshold until the quality of the recov- sider the questions Q1760: Where was C.S. Lewis ered information is satisfactory (see ?) for further born? and Q1519: Where was Hans Christian details). Anderson born?. Both questions have the same When the passages have been gathered, they Level2 representation: Where STATIVE PERSON are split into sentences and processed with POS STATIVE?, and the same n-best list of transla- tagging, chunking and a NERC. The candidate an- tions. Any sentence stating the birthplace (or even swer list is composed of all named entities and deathplace) of any person is equally likely to be all phrases containing a noun. Each candidate is the correct answer of both questions because the associated to the sentence it has been extracted lexicalisation of Lewis and Anderson is lost. from. On the other hand, B and R also show another limitation. Since they are based on n-gram match- 4.3 Answer Ranking ing, they cannot be discriminative enough when This module selects the best answers from the there is only one different token between options, candidates previously generated. It employs three and that happens when a same sentence has differ- families of scores to rank them. ent candidates for the answer. In this case the sys- tem would be able to distinguish among answer Context scores B and R: The n-best list of sentences but then all the variations with the an- Level2 question translations is generated. In this swer in a different position would have too much step retrieved sentences are also transformed to similar scores. In order to mitigate these draw- the Level2 representation. Then, each candidate backs, we consider two other scores. answer is replaced by the special ANSWER tag in Language scores L ,L ,L : To alleviate the the associated sentence, thus, each sentence has a b r f discriminative problem of the context matching unique ANSWER tag, as in the training examples. metrics, we calculate the same B and R scores Finally, each candidate is evaluated assessing the but with Level1 translations and the original lexi- similarity of the source sentence with the n-best calised question. These are the L and L scores. translations. b r Additionally, we introduce a new score L that For this assessment we use two different met- f does not take into account the n-gram structure rics. One of them is a lexical metric commonly of the sentences: after the n-best list of Level1 used in machine translation, BLEU (Papineni et question translations is generated, the frequency al., 2002). A smoothed version is used to evalu- of each word present in the translations is com- ate the pairs at sentence level yielding the score B. puted. Then, the words in the candidate answer The other metric is ROUGE (Lin and Och, 2004), sentence are scored according to their normalised here named R. We use the skip-bigram overlap- frequency in the translations list and added up to- ping measure with a maximum skip distance of gether. This score lies in the [0, 1] range. 4 unigrams (ROUGE-S4). Contrary to BLEU, ROUGE-S does not require consecutive matches Expected answer type score E: This score but is still sensitive to word order. checks if the type of the answer we are evalu- Both BLEU and ROUGE are well-known met- ating matches the expected types we have deter- rics that are useful for finding partial matchings in mined in the question analysis. For this task, the long strings of words. Therefore it is an easy way expected answer types are mapped to named enti- of implementing an approximated pattern match- ties and/or supersenses (e.g., type ENTY:product 24 is mapped to ARTIFACT). If the candidate answer Q A TRECs is a named entity of the expected type, or con- Train 2264 12116 9,10,12,13,14,15,16 tains a noun of the expected supersense, then this Dev 219 219 9,10,12 candidate receives a score E equal to the confi- Test 500 2551 11 dence of the question classification (the scores of the ME classifier have been previously normalised Table 1: Number of Questions and Answers in our data to probabilities). sets. The number of TREC evaluation from which are obtained is indicated. These three families of scores can be combined in several ways in order to produce a ranked list Tokens Vocabulary of answers. In Section 6 the combination methods QAQA are discussed. TrainL1 97028 393978 3232 32013 TrainL2 91567 373008 540 9130 5 Experiments Table 2: Statistics for the 12,116 Q-A pairs in the train- 5.1 Training and Test Corpora ing corpus according to the annotation level. We have used the datasets from the Question Answering Track of the TREC evaluation cam- we use the TnT POS tagger (Brants, 2000), paigns3 ranging from TREC-9 to TREC-16 in our WordNet (Fellbaum, 1998), the YamCha chun- experiments. These datasets provide both a robust ker (Kudo and Matsumoto, 2003), the Stanford testbed for evaluation, and a source of question- NERC (Finkel et al., 2005), and an in-house tem- answer pairs to use as a parallel corpus for train- poral expressions recogniser. ing our SMT system. Each TREC evaluation Table 2 shows some statistics for the parallel provides a collection of documents composed of corpus and the two different levels of annotation. newspaper texts (three different collections have From the SMT point of view the corpus is small been used over the years), a set of new ques- in order to estimate the translation probabilities in tions, and an answer key providing both the an- a reliable way but, as stated before, Level2 repre- swer string and the source document. Descrip- sentation diminishes the vocabulary considerably tion of these collections can be found in the TREC and alleviates the problem. overviews (Voorhees, 2002; Dang et al., 2007). We use the TREC-11 questions for test pur- 5.2 SMT system poses, the remaining sets are used for training un- The statistical system is a state-of-the-art phrase- less some parts of TREC-9, TREC-10 and TREC- based SMT system trained on the previously 12 that are kept for fitting the weights of our SMT introduced corpus. Its development has been system. To gather the SMT corpus, we select all done using standard freely available software. the factoid questions whose answer can be found The language model is estimated using interpo- in the documents and extract the full sentence that lated Kneser-Ney discounting with SRILM (Stol- contains the answer. With this methodology, a cke, 2002). Word alignment is done with parallel corpus with 12,335 question-answer pairs GIZA++ (Och and Ney, 2003) and both phrase is obtained. We have divided it into two subsets: extraction and decoding are done with the Moses the pairs with only a single answer found in the package (Koehn et al., 2007). The model weights documents are used for the development set, and are optimised with Moses’ script of MERT the remaining pairs (i.e. having multiple occur- against the BLEU evaluation metric. rences of the correct answer) are used for train- For the full model, we consider the language ing. The test set are the 500 TREC-11 questions, model, direct and inverse phrase probabilities, di- 452 out of them have a correct answer in the doc- rect and inverse lexical probabilities, phrase and uments. The numbers are summarised in Table 1. word penalties, and a non-lexicalised reordering. In order to obtain the Level2 representation of these corpora, the documents and the test sets 5.3 QA system must be annotated. For the annotation pipeline The question answering system has three differ- 3http://trec.nist.gov/data/qamain.html ent modules as explained in Section 4. For the 25 T1 T50 MRR QA SR QA 0.006 (4) 0.206 (14) 0.024 (4) Metric T1 T50 MRR T1 T50 MRR SR 0.066 (8) 0.538 (9) 0.142 (8) B 0.018 0.292 0.049 0.084 0.540 0.164 Upper bound 0.677 0.677 0.677 R 0.018 0.283 0.045 0.119 0.608 0.209 B+R 0.022 0.294 0.053 0.097 0.573 0.180 Table 3: Mean and standard deviation for 1000 real- BR 0.027 0.294 0.057 0.137 0.591 0.211 isations of the random baseline for QA and SR. The upper bound is also shown. Table 4: System performance using an SMT that gen- erates a 100-best list, uses a 5-gram LM and all the features of the TM. first module, questions are annotated using the same tools introduced in the corpora Section. The 1st best: The B-ORGANIZATION B-LOCATION , second module generates 2,866,098 candidate an- B-DATE ( B-ORGANIZATION ) - B-PERSON , whose swers (373,323 different sentences), that is to say, COMMUNICATION STATIVE ” ANSWER .” a mean of 5,700 answers per question (750 sen- 50th best: The ANSWER ANSWER , B-DATE ( B-ORGA- tences per question). These candidates are made NIZATION ) - B-PERSON , the PERSON of ANSWER available to the third module resulting in the ex- , the most popular ARTIFACT , serenely COGNITION COMMUNICATION . periments that will be discussed in Section 6. The global QA system performance is evalu- 100th best: The B-LOCATION , B-DATE ( B-ORGANIZA- TION ) - B-PERSON , the PERSON of ANSWER , COM- ated with three measures. T1 is a measure of MUNICATION B-LOCATION ’s COMMUNICATION . the system’s precision and gives the percentage of correct answers in the first position; T50 gives the number of correct answers in the first 50 po- Figure 2: Example of patterns found in an n-best list. sitions, in some cases that corresponds to all can- didate answers; finally the Mean Reciprocal Rank full sentences without marked answers are taken (MRR) is a measure of the ranking capability of into account, can also be read in Table 3. the system and is estimated as the mean of the in- We begin this analysis studying the perfor- verse ranking of the first correct answer for every mance of the SMT-based parts alone. Table 4 question: MRR= Q−1 P rank−1. i i shows the results when using an SMT decoder 6 Results Analysis that generates a 100-best list, uses a 5-gram lan- guage model and all the features of the transla- Given the set of answers retrieved by the candi- tion model. An example of the generated patterns date answer generation module, a na¨ıve baseline in Level2 representation can be found in Figure 2 system is estimated by selecting randomly 50 an- for the question of Figure 1, Q1565: What is Karl swers for each of the questions. Table 3 shows Malone’s nickname?. the mean of the three measures after applying this Candidate answer sentences are ranked accord- random process 1000 times. The upper bound ing to the similarity with the patterns generated by of this task is the oracle that selects always the translation as measured by BLEU (B), ROUGE- correct answer/sentence if it is present in the re- S4 (R) or combinations of them. To calcu- trieved passages. An answer is considered correct late these metrics the n-best list with patterns is if it perfectly matches the official TREC’s answer considered to be a list of reference translations key and a sentence is correct if it contains a cor- (Fig. 2) to every candidate (Fig. 1). In general, rect answer. The random baseline has a precision a combination of both metrics is more powerful of 0.6%. than any of them alone and the product outper- We also evaluate a second task, sentence re- forms the sum given that in most cases BLEU is trieval for QA (SR). In this task, the system has larger than ROUGE and smooths its effect. The to provide a sentence that contains the answer, but inclusion of the SMT patterns improves the base- not to extract it. Within our SMT approach, both line but it does not imply a quantum leap. T1 is tasks are done simultaneously, because the answer at least three times better than the baseline’s one is extracted according to its context sentence. A but still the system answers less than a 3% of the random baseline for this second task, where only questions. In the first 50 positions the answer is 26 SMT Features T1 T50 MRR QA SR Lex, LM5, 100-best 0.027 0.294 0.057 Metric T1 T50 MRR T1 T50 MRR noLex, LM5, 100-best 0.015 0.281 0.045 Lf 0.016 0.286 0.046 0.137 0.605 0.236 Lex, LM3, 100-best 0.015 0.257 0.041 Lb 0.022 0.304 0.054 0.100 0.581 0.192 Lex, LM7, 100-best 0.033 0.288 0.050 Lr 0.018 0.326 0.060 0.131 0.627 0.225 L 0.038 0.330 0.079 0.147 0.622 0.238 Lex, LM5, 10-best 0.024 0.310 0.056 brf E 0.044 0.373 0.096 0.058 0.579 0.142 Lex, LM5, 1000-best 0.027 0.301 0.061 EL 0.018 0.293 0.048 0.118 0.623 0.214 Lex, LM5, 10000-best 0.011 0.290 0.045 brf BLbrf 0.051 0.337 0.091 0.184 0.616 0.271 Table 5: System performance with different combina- RLbrf 0.033 0.346 0.069 0.191 0.618 0.279 tions of the SMT features used in decoding. BR is the BRLbrf 0.042 0.350 0.082 0.182 0.616 0.273 metric used to score the answers. (B+R)Lbrf 0.044 0.346 0.085 0.187 0.618 0.273 BE 0.035 0.384 0.084 0.086 0.579 0.179 RE 0.035 0.377 0.086 0.131 0.630 0.228 30% found a of the times. In the sentence re- BRE 0.049 0.377 0.098 0.135 0.608 0.220 trieval task, results grow up to 14% and 59% re- (B+R)E 0.040 0.386 0.091 0.102 0.596 0.196 spectively. Its difference between tasks shows one BEL 0.093 0.379 0.137 0.200 0.621 0.283 of the limitations of these metrics commented be- brf REL 0.071 0.377 0.123 0.208 0.619 0.294 fore, they are not discriminative enough when the brf BRELbrf 0.091 0.379 0.132 0.200 0.622 0.287 only difference among options is the position of (B+R)ELbrf 0.100 0.377 0.141 0.204 0.621 0.286 the ANSWER tag inside the sentence. This is the empirical indication of the need for a score like Table 6: System performance according to three dif- E. On the other hand, each question has a mean ferent ranking strategies: context score (B and R), the of 5,732 candidate answers, and although T50 is language scores (Lx) and EAT type checking (E). not a significant measure, its good results indicate that the context scores metrics are doing their job. ments in a n-best list usually differ very little, and The highest T50, 0.608, is reached by R and it is this is even more important for a system with a very close to the upper bound 0.667. reduced vocabulary, increasing the size of the list Taking BR as a reference measure, we investi- does not enrich in a substantial way the variety of gate the impact of three features of the SMT in the generated answers and results show no signif- Table 5. Regarding the length of the language icant variances. Given these observations, we fix model used in the statistical translation, there is an SMT system with a 5-gram language model, a trend to improve the accuracy with longer lan- the full set of translation model features and the guage models (T1 is 0.015 for a LM3, 0.027 for generation of a 100-best list for obtaining B and LM5 and 0.033 for LM7 with the product of met- R scores. rics) but recall is not very much affected and the Each score approaches different problems of best values are obtained for LM5. the task and therefore, complement each other Second, the number of features in the trans- rather than overlapping. Table 6 introduces the lation model indicates that the best scores are results of a selected group of score combinations, reached when one reproduces the same number where Lbrf =LbLrLf . of features as a standard translation system. That The scores Lbrf and E alone are not very useful is, all of the measures when the lexical trans- because Lbrf gives the same score to all candi- lation probabilities are ignored are significantly dates in the same sentence and E gives the same lower than when the eight features are used. In score to all candidates of the same type. Exper- a counterintuitive way, the token to token transla- imental results confirm that, as expected, Lbrf is tion probability helps to improve the final system more appropriate for the SR task and E for the although word alignments here can be meaning- QA task, although the figures are very low. When less or nonexistent given the difference in length joining E and the Ls together, no improvement is and structure between question and answer. obtained, and the results for the QA task are worse Finally, the length of the n-best list is not a de- than Lbrf alone, thus demonstrating that Level1 cisive factor to take into account. Since the ele- translations are not good enough for the QA task. 27 A better system combines all the metrics together. Croft (2005), however, since test sets are different The best results are achieved when adding B they are not directly comparable. This is notable and R scores to the combination. All of these because we tackle QA, and sentence retrieval is combinations (i.e. B, R, BR and B+R) are bet- obtained as collateral information. ter when are multiplied by both E and Lbrf than Possible lines of future research include the by only one of them alone. Otherwise, combina- study abstraction levels different from Level2. tions of only E and Lbrf yield very poor results. The linguistic processors provide us with interme- Thus, the Level2 representation boosts T1 scores diate information such as POS that is not currently from 0.018 (ELbrf ) to 0.100 ((B+R)ELbrf ) in QA used as it is WordNet and named entities. Sev- and almost doubles it in SR. As a general trend, eral other levels combining this information can we see that combinations involving R but not B be also tested in order to find the most appropri- are better in the SR task than in the QA task. In ate degree of abstraction for each kind of word. fact the best results for SR are obtained with the The development part of the SMT system is a RELbrf combination. The best MRR scores are delicate issue. MERT is currently optimising to- achieved also with the best T1 scores. wards BLEU, but the final score for ranking the answers is a combination of a smoothed BLEU, 7 Discussion and Conclusions ROUGE, L and E. It has been shown that opti- The results here presented are our approach to mising towards the same metric used to evaluate consider question answering a translation prob- the system is beneficial for translation, but also lem. Questions in an abstract representation that BLEU is one of the most robust metrics to (Level2) are translated into an abstract represen- be used (Cer et al., 2010), so the issue has to tation of the answer, and these generated answers be investigated for the QA problem. Also, refin- are matched against all the candidates obtained ing BLEU and ROUGE for this specific problem with the retrieval module. The candidates are then can be useful. A first approximation could be an ranked according to their similarity with the n- adaptation of the n-gram counting of BLEU and best list of translations as measured by three fam- ROUGE so that it is weighted by its distance to ilies of metrics that include R, B, E and L. the answer; this way sentences that differ only be- The best combination of metrics is able to cause of the candidate answer string would be bet- answer a 10.0% of the questions in first place ter differentiated. (T1). This result is in the lowest part of the ta- Related to this, the generation of the candidate ble reported by the official TREC-11 overview answer strings is exhaustive; the suppression of (Voorhees, 2002). The approach of Echihabi the less frequent candidates could help to elimi- and Marcu (2003) that uses translation proba- nate noise in the form of irrelevant answer sen- bilities to rank the answers achieves higher re- tences. Besides, the system correlates these an- sults on the same data set (an MRR of 0.325 swer strings with the expected answer type of versus our 0.141). Although both works use the question (coincidence measured with E). This SMT techniques, the approach is quite different. step should be replaced by an SMT-based mech- In fact, our system is more similar in spirit to anism to build a full system only based on SMT. that of Ravichandran and Hovy (2002), which Furthermore, we plan to include the Level1 trans- learns regular expressions to find answer contexts lations into the candidate answer generation mod- and shows significant improvements for out-of- ule in order to do query expansion in the style of domain test sets, that is web data. Besides the fact Riezler et al. (2007). that Echihabi and Marcu use translation models Acknowledgements instead of a full translation system, they explicitly treat the problem of the difference of length be- This work has been partially funded by the Eu- tween the question and the answer. In our work, ropean Community’s Seventh Framework Pro- this is not further considered than by the word and gramme (MOLTO project, FP7-ICT-2009-4- phrase penalty features of the translation model. 247914) and the Spanish Ministry of Science Future work will address this difficulty. and Innovation projects (OpenMT-2, TIN2009- The results of sentence ranking of our system 14675-C03-01 and KNOW-2, TIN2009-14715- are similar to those obtained by Murdock and C04-04). 28 References X. Li and D. Roth. 2005. Learning question classi- fiers: The role of semantic information. Journal of A. Berger and J. Lafferty. 1999. Information retrieval Natural Language Engineering. as statistical translation. In Proceedings of ACM SI- C-Y. Lin and F. Och. 2004. Automatic Evaluation of GIR Conference. Machine Translation Quality Using Longest Com- A. Berger, R. Caruana, D. Cohn, D. Freitag, and mon Subsequence and Skip-Bigram Statics. In Pro- V. Mittal. 2000. Bridging the lexical chasm: statis- ceedings of the ACL Conference. tical approaches to answer-finding. In Proceedings D. Moldovan, C. Clark, S. Harabagiu, and D. Hodges. of the ACM SIGIR Conference. 2007. Cogex: A semantically and contextually en- T. Brants. 2000. TnT – a statistical part-of-speech riched logic prover for question answering. Journal tagger. In Proceedings ANLP Conference. of Applied Logic, 5(1). P. F. Brown, V. J. Della Pietra, S. A. Della Pietra, and V. Murdock and W.B. Croft. 2005. Simple translation R. L. Mercer. 1993. The mathematics of statistical models for sentence retrieval in factoid question an- machine translation: parameter estimation. Compu- swering. In Procedings of the ACM SIGIR Confer- tational Linguistics, 19(2). ence. D. Cer, C. D. Manning, and D. Jurafsky. 2010. The F. Och and H. Ney. 2002. Discriminative Training and best lexical metric for phrase-based statistical MT Maximum Entropy Models for Statistical Machine system optimization. In Proceeding of the HLT Translation. In Proceedings of the ACL Conference. Conference. F. Och and H. Ney. 2003. A systematic comparison H. Cui, R. Sun, K. Li, M.Y. Kan, and T.S. Chua. 2005. of various statistical alignment models. Computa- Question answering passage retrieval using depen- tional Linguistics, 29(1). dency relations. In Proceedings of the ACM SIGIR F. Och. 2003. Minimum error rate training in statisti- Conference. cal machine translation. In Proceedings of the ACL Conference. H.T. Dang, D. Kelly, and J. Lin. 2007. Overview of K. Papineni, S. Roukos, T. Ward, and W-J. Zhu. 2002. the TREC 2007 question answering track. In Pro- BLEU: a Method for Automatic Evaluation of Ma- ceedings of the Text REtrieval Conference, TREC. chine Translation. In Proceedings of the ACL Con- A. Echihabi and D. Marcu. 2003. A noisy-channel ference. approach to question answering. In Proceedings of A. Penas,˜ E. Hovy, P. Forner, A.´ Rodrigo, R. Sutcliffe, the ACL Conference. ACL. C. Forascu, and C. Sporleder. 2011. Overview C. Fellbaum. 1998. WordNet: An Electronic Lexical of QA4MRE at CLEF 2011: Question answering Database. Cambridge, MA: MIT Press. for machine reading evaluation. Working Notes of D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, CLEF. D. Gondek, A. Kalyanpur, A. Lally, J. Murdock, D. Ravichandran and E. Hovy. 2002. Learning surface E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. text patterns for a question answering system. In 2010. Building Watson: An overview of the Proceedings of the ACL Conference. DeepQA project. AI Magazine, 31(3):59–79. S. Riezler, A. Vasserman, I. Tsochantaridis, V. Mit- J. R. Finkel, T. Grenager, and C. D. Manning. 2005. tal, and Y. Liu. 2007. Statistical machine trans- Incorporating non-local information into informa- lation for query expansion in answer retrieval. In tion extraction systems by gibbs sampling. In ACL. Proceedings of the ACL Conference. M.H. Heie, E.W.D. Whittaker, and S. Furui. 2011. T. Sakai, N. Kando, C.J. Lin, T. Mitamura, H. Shima, Question answering using statistical language mod- D. Ji, K.H. Chen, and E. Nyberg. 2008. Overview elling. Computer Speech & Language. of the NTCIR-7 ACLIA IR4QA task. In Proceed- P. Koehn, H. Hoang, A. Mayne, C. Callison-Burch, ings of NTCIR Conference. M. Federico, N. Bertoldi, B. Cowan, W. Shen, R. Soricut and E. Brill. 2004. Automatic question C. Moran, R. Zens, C. Dyer, O. Bojar, A. Con- answering: Beyond the factoid. In Proceedings of stantin, and E. Herbst. 2007. Moses: Open source HLT-NAACL Conference. toolkit for statistical machine translation. In Annual A. Stolcke. 2002. SRILM – An extensible language Meeting of the ACL, Demonstration Session. modeling toolkit. In Proc. Intl. Conf. on Spoken Language Processing. T. Kudo and Y. Matsumoto. 2003. Fast methods for M. Surdeanu, M. Ciaramita, and H. Zaragoza. 2011. kernelbased text analysis. In Proceedings of ACL Learning to rank answers to non-factoid questions Conference. from web collections. Computational Linguistics, J.T. Lee, S.B. Kim, Y.I. Song, and H.C. Rim. 2008. 37(2). Bridging lexical gaps between queries and ques- E.M. Voorhees. 2002. Overview of the TREC 2002 tions on large online Q&A collections with compact Question Answering track. In In Proceedings of the translation models. In Proceedings of the EMNLP Text REtrieval Conference, TREC. Conference. ACL. 29 An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation

Chong Tze Yuang Rafael E. Banchs Chng Eng Siong School of Computer Engi- Institute for Infocomm Re- School of Computer Engi- neering, Nanyang Technolo- search, A*STAR, 138632, neering Nanyang Technolo- gical University, 639798 Singapore gical University, 639798 Singapore [email protected] Singapore. [email protected] star.edu.sg [email protected]

instance, the eight most frequent words in the Abstract Brown Corpus are the , of , and , to , a, in , that and is (Bell et al. 1990). In this paper we evaluate the possibility of It is a common practice in Information Re- improving the performance of a statistical trieval (IR) to filter the most frequent words out machine translation system by relaxing the from processed documents (which are referred to complexity of the translation task by remov- as stop words), as these function words are se- ing the most frequent and predictable terms mantically non-informative and constitute weak from the target language vocabulary. After- wards, the removed terms are inserted back indexing terms. By removing this great amount in the relaxed output by using an n-gram of stop words, not only space and time complexi- based word predictor. Empirically, we have ties can be reduced, but document content can be found that when these words are omitted better discriminated by the remaining content from the text, the perplexity of the text de- words (Fox, 1989; Rijsbergen, 1979; Zou et al., creases, which may imply the reduction of 2006; Dolamic & Savoy 2009). confusion in the text. We conducted some Inspired by the concept of stop word removal machine translation experiments to see if in Information Retrieval, in this work we study this perplexity reduction produced a better the feasibility of stop word removal in Statistical translation output. While the word predic- Machine Translation (SMT). Different from In- tion results exhibits 77% accuracy in pre- dicting 40% of the most frequent words in formation Retrieval, that ranks or classifies doc- the text, the perplexity reduction did not uments; SMT hypothesizes sentences in target help to produce better translations. language. Therefore, without explicitly removing frequent words from the documents, we proposed 1 Introduction to ignore such words in the target language vo- cabulary, i.e. by replacing those words with a It is a characteristic of natural language that a null token. We term this process as “relaxation” large proportion of running words in a corpus and the omitted words as “relaxed words”. corresponds to a very small fraction of the voca- Relaxed SMT here refers to a translation task bulary. An analysis of the Brown Corpus has in which target vocabulary words are intentional- shown that the hundred most frequent words ly omitted from the training dataset for reducing account for 42% of the corpus, while only 0.1% translation complexity. Since the most frequent in the vocabulary. On the other hand, words oc- words are targeted to be relaxed, as a result, there curring only once account merely 5.7% in the will be vast amount of null tokens in the output corpus but 58% in the vocabulary (Bell et al. text, which later shall be recovered in a post 1990). This phenomenon can be explained in processing stage. The idea of relaxation in SMT terms of Zipf’s Law, which states that the prod- is motivated by one of our experimental findings, uct of word ranks and their frequencies approx- in which the perplexity measured over a test set imates a constant, i.e. word-frequency plot is decreases when most frequent words are relaxed. close to a hyperbolic function, and hence the few For instance, a 15% of perplexity reduction is top ranked words would account for a great por- observed when the twenty most frequent words tion of the corpus. Also, it appears that the top are relaxed in the English EPPS dataset. The ranked words are mainly function words. For reduction of perplexity allows us to conjecture

30 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 30–37, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics about the decrease of confusion in the text, from after the word alignment process but before the which a SMT system might be benefited. extraction of translation units and the computa- After applying relaxed SMT, the resulting tion of model probabilities. The main objective null tokens in the translated sentences have to be of this relaxation procedure is twofold: on the replaced by the corresponding words from the set one hand, it attempts to reduce the complexity of of relaxed words. As relaxed words are chosen the translation task by reducing the size of the from the top ranked words, which possess high target vocabulary while affecting a large propor- occurrences in the corpus, their n-gram probabili- tion of the running words in the text; on the other ties could be reliably trained to serve for word hand, it should also help to reduce model sparse- prediction. Also, these words are mainly function ness and improve model probability estimates. words and, from the human perspective, function Of course, different from the Information Re- words are usually much easier to predict from trieval case, in which stop words are not used at their neighbor context than content words. Con- all along the search process, in the considered sider for instance the sentence the house of the machine translation scenario, the removed words president is very nice . Function words like the , need to be recovered after decoding in order to of, and is , are certainly easier to be predicted than produce an acceptable translation. The relaxed content words such as house , president , and nice . word replacement procedure, which is based on The rest of the paper is organized into four an n-gram based predictor, is implemented as a sections. In section 2, we discuss the relaxation post-processing step and applied to the relaxed strategy implemented for a SMT system, which machine translation output in order to produce generates translation outputs that contain null to- the final translation result. kens. In section 3, we present the word predic- Our bet here is that the gain in the translation tion mechanism used to recover the null tokens step, which is derived from the relaxation strate- occurring in the relaxed translation outputs. In gy, should be enough to compensate the error section 4, we present and discuss the experimen- rate of the word prediction step, producing, in tal results. Finally, in section 5 we present the overall, a significant improvement in translation most relevant conclusion of this work. quality with respect to the non-relaxed baseline procedure. 2 Relaxation for Machine Translation The next section describes the implemented word prediction model in detail. It constitutes a In this paper, relaxation refers to the replacement fundamental element of our proposed relaxed of the most frequent words in text by a null to- SMT approach. ken. In the practice, a set of frequent words is defined and the cardinality of such set is referred 3 Frequent Word Prediction to as the relaxation order. For example, lets the relaxation order be two and the two words on the Word prediction has been widely studied and top rank are the and is . By relaxing the sample used in several different tasks such as, for exam- sentence previously presented in the introduc- ple, augmented and alternative communication tion, the following relaxed sentence will be ob- (Wandmacher and Antoine, 2007) and spelling tained: NULL house of NULL President NULL correction (Thiele et al., 2000). In addition to the very beautiful. commonly used word n-gram, various language From some of our preliminary experimental modeling techniques have been applied, such as results with the EPPS dataset, we did observe the semantic model ( Luís and Rosa, 2002; Wand- that a relaxation order of twenty leaded to a per- macher and Antoine, 2007) and the class-based plexity reduction of about a 15%. To see whether model (Thiele et al., 2000; Zohar and Roth, this contributes to improving the translation per- 2000; Ruch et al., 2001). formance, we trained a translation system by The role of such a word predictor in our con- relaxing the top ranked words in the vocabulary sidered problem is to recover the null tokens in of the target language. In this way, there will be a the translation output by replacing them with the large number of words in the source language words that best fit the sentence. This task is es- that will be translated to a null token. For exam- sentially a classification problem, in which the ple: la (the in Spanish) and es (is in Spanish) will most suitable relaxed word for recovering a giv- be both translated to a null token in English. en null token must be selected. In other words, This relaxation of terms is only applied to the ʚ ʛ ͫ$ Ɣ max 1ĜǨ ͊sentence … ͫ$ͯͥͪ$ͫ$ͮͥ … , where target language vocabulary, and it is conducted ͊sentence ʚ·ʛ is the probabilistic model, e.g. n-

31 gram, that estimates the likelihood of a sentence and the bidirectional n-gram to predict the word when a null token is recovered with word ͪ$ , in middle, which can be formulated as follows: ʚ ʛ drawn from the set of relaxed words ͌. The car- ͫ$ Ɣ max 1ĜǨ ͊ ͪ$|ͫ$ͯͥ, ͫ$ͮͥ (4) dinality |͌| is referred to as the relaxation order, e.g. |͌| Ɣ 5 implies that the five most frequent Notice that the backward n-gram model can words have been relaxed and are candidates to be be estimated from the word counts as: recovered. Notice that the word prediction problem in ʚ2Ĝ2Ĝ~u2Ĝ~vʛ ͊ʚͫ$|ͫ$ͮͥͫ$ͮͦʛ Ɣ (5) this task is quite different from other works in the ʚ2Ĝ~u2Ĝ~vʛ literature. This is basically because the relaxed words to be predicted in this case are mainly or, it can be also approximated from the forward function words. Firstly, it may not be effective to n-gram model, as follows: predict a function word semantically. For exam- ʚ2 |2 2 ʛʚ2 |2 ʛʚ2 ʛ ple, we are more certain in predicting equity than ͊ʚͫ |ͫ ͫ ʛ Ɣ Ĝ~v Ĝ Ĝ~u Ĝ~u Ĝ Ĝ $ $ͮͥ $ͮͦ ʚ ʛ ʚ ʛ for share  2Ĝ~v|2Ĝ~u  2Ĝ~u given the occurrence of in the sentence. (6) Secondly, although class-based modeling is com- monly used for prediction, its original intention Similarly, the bidirectional n-gram model can is to tackle the sparseness problem, whereas our be estimated from the word counts: task focuses only on the most frequent words. In this preliminary work, our predicting me- ͊ʚͫ$|ͫ$ͯͥͫ$ͮͥʛ Ɣ chanism is based on an n-gram model. It predicts ʚ2 |2 2 ʛʚ2 |2 ʛʚ2 ʛ Ĝ~u Ĝu Ĝ Ĝ Ĝu Ĝu (7) the word that yields the maximum a posteriori ∑ ʚ2 |2 1 ʛʚ1 |2 ʛʚ2 ʛ ĩĜǨď Ĝ~u Ĝu Ĝ Ĝ Ĝu Ĝu probability, conditioned on its predecessors. For the case of the trigram model, it can be expressed or approximated from the forward model: as follows:

͊ʚͫ$|ͫ$ͯͥͫ$ͮͥʛ Ɣ ͫ Ɣ max ͊ʚͪ |ͫ ͫ ʛ (1) ʚ2 |2 2 ʛʚ2 |2 ʛʚ2 ʛ $ 1ĜǨ $ $ͯͦ $ͯͥ Ĝ~u Ĝu Ĝ Ĝ Ĝu Ĝu (8) ∑ ʚ2 |2 1 ʛʚ1 |2 ʛʚ2 ʛ ĩĜǨď Ĝ~u Ĝu Ĝ Ĝ Ĝu Ĝu Often, there are cases in which more than one null token occur consecutively. In such cases The word prediction results of using the for- predicting a null token is conditioned on the pre- ward, backward, and bidirectional n-gram mod- vious recovered null tokens. To prevent a predic- els will be presented and discussed in the ex- tion error from being propagated, one possibility perimental section. is to consider the marginal probability (summed The three n-gram models discussed so far over the relaxed word set) over the words that predict words based on the local word ordering. were previously null tokens. For example, if ͪ$ͯͥ There are two main drawbacks to this: first, only is a relaxed word, which has been recovered the neighboring words can be used for prediction from earlier predictions, then the prediction of ͪ$ as building higher order n-gram models is costly; should no longer be conditioned by ͪ$ͯͥ . This and, second, prediction may easily fail when con- can be computed as follows: secutive null tokens occur, especially when all words conditioning the prediction probability are ʚ ʛ recovered null tokens. Hence, instead of predict- ͫ$ Ɣ max 1ĜǨ ʚ1ĜuǨ ͊ ͪ$|ͫ$ͯͦͪ$ͯͥ Ɣ ∑ ʚ ʛ ing words by maximizing the local probability, max 1ĜuǨ ͊ ͪ$|ͫ$ͯͦͪ$ͯͥ (2) 1ĜǨ predicting words by maximizing a global score (i.e. a sentence probability in this case), may be a The traditional n-gram model, as discussed better alternative. previously, can be termed as the forward n-gram At the sentence level, the word predictor con- model as it predicts the word ahead. Additional- siders all possible relaxed word permutations and ly, we also tested the backward n-gram to predict searchs for the one that yields the maximum a the word behind (i.e. on the left hand side of the posteriori sentence probability. For the trigram target word), which can be formulated as: model, a relaxed word that maximizes the sen- tence probability can be predicted as follows: ͫ Ɣ max ͊ʚͪ |ͫ ͫ ʛ (3) $ 1ĜǨ $ $ͮͥ $ͮͦ ∏ ʚ ʛ ͫ$ Ɣ max 1ĜǨ $Ͱͥ ͊ ͪ$|ͫ$ͯͦͫ$ͯͥ (9)

32 where, ͈ is the number of words in the sentence.

Although the forward, backward, and interpo- 40 lated models have been shown to be applicable for local word prediction, they make no differ- 35 ence at sentence level predictions as they pro- 30 duce identical sentence probabilities. It is not 25 hard to prove the following identity: 20 TEST   Cumulative frequency(%) 15 ∏$Ͱͥ ͊ʚͫ$|ͫ$ͯͦͫ$ͯͥʛ Ɣ ∏$Ͱͥ ͊ʚͫ$|ͫ$ͮͥͫ$ͮͦʛ DEV (10) 10 5 4 Experimental Results and Discussion 0 5 10 15 20 Word Rank In this section, we first highlight the Zipfian dis- tribution in the corpus and the reduction of per- Figure 2. Cumulative relative frequencies of the plexity after removing the top ranked words. The top ranked words (up to order 20) n-gram probabilities estimated were then used for word prediction, and we report the resulting We found that when the most frequent words prediction accuracy at different relaxation orders. were relaxed from the vocabulary, which means The performance of the SMT system with a re- being replaced by a null token, the perplexity laxed vocabulary is presented and discussed in (measured with a trigram model) decreased up to the last subsection of this section. 15% for the case of a relaxation order of twenty (Figure 3). This implies that the relaxation causes 4.1 Corpus Analysis the text becoming less confusing, which might The data used in our experiments is taken from benefit natural language processing tasks such as the EPPS (Europarl Corpus). We used the ver- machine translation. sion available through the shared task of the 2007's ACL Workshops on Statistical Machine 65 Translation (Burch et al., 2007). The training set 63 DEV 61 comprises 1.2M sentences with 35M words while TEST the development set and test sets contains 2K 59 sentences each. 57 From the train set, we computed the twenty 55 most frequent words and ranked them according- Perplexity 53 ly. We found them to be mainly function words. 51 Their counts follow closely a Zipfian distribution 49 (Figure 1) and account for a vast proportion of 47 45 the text (Figure 2). Indeed, the 40% of the run- 0 1 2 3 4 5 6 7 8 9 1011121314151617181920 ning words is made up by these twenty words. Relaxation Order

8 Figure 3. Perplexity decreases with the relaxation 7 TEST of the most frequent words 6 DEV 5 4.2 Word Prediction Accuracy 4 3 In order to evaluate the quality of the different

Relative frequency Relative frequency (%) 2 prediction strategies, we carried out some expe- 1 riments for replacing null tokens with relaxed 0 words. For this, frequent words were dropped i a it is in of to as be on for we are the

this manually from text (i.e. replaced with null to- and that have period comma Top Ranked Words kens) and were recovered later by using the word predictor. As discussed earlier, a word can be predicted locally, to yield maximum n-gram Figure 1. The twenty top ranked words and their probability, or globally, to yield maximum sen- relative frequencies

33 tence probability. In a real application, a text trigram model. These results are shown in Figure may comprise up to 40% of null tokens that must 5. From the cumulative frequency showed in be recovered from the twenty top ranked words. Figure 2, we could see that 40% of the words in For n-gram level prediction (local), we eva- text could now be predicted with an accuracy of luated word accuracy at different orders of relax- about 77%. ation. More specifically, we tested the forward trigram model, the backward trigram model, the 100 bidirectional model, and the linear interpolation between forward and backward trigram models 90 (with weight 0.5). The accuracy was computed as the percentage of null tokens recovered success- 80 fully with respect to the original words. These results are shown in Figure 4. Notice that the 70

accuracy of the relaxation of order one is 100%, (%) Acuuracy so it has not been depicted in the plots. 60 Notice from the figure how the forward and Bigram backward models performed very alike through- 50 out the different relaxation orders. This can be Trigram explained in terms of their similar perplexities 40 (both models exhibit a perplexity of 58). Better 0 5 10 15 20 accuracy was obtained by the interpolated model, Relaxation Order which demonstrates the advantage of incorporat- ing the left and right contexts in the prediction. Figure 5. Accuracy of word prediction at the Different from the interpolated model, which sentence level. simply adds probabilities from the two models, the bidirectional model estimates the probability For predicting words by maximizing the sen- from the left and right contexts simultaneously tence probability, two methods have been tried: during the training phase; thus it produces a bet- first, a brute force method that attempts all possi- ter result. However, due to its heavy computa- ble relaxed word permutations for the occurring tional cost (Equation 8), it is infeasible to apply it null tokens within a sentence and finds the one at orders higher than five. producing the largest probability estimate; and, second, applying Viterbi decoding over a word 100 lattice, which is built from the sentence by re- Trigram(Forward) placing the arcs of null tokens with a parallel Trigram(Backward) 90 network of all relaxed words. Interpolation All the arcs in the word lattice have to be 80 Bidirectional weighted with n-gram probabilities in order to search for the best route. In the case of the tri- 70 gram model, we expand the lattice with the aid of Accuracy the SRILM toolkit (Stolcke 2002). Both methods 60 yield almost identical prediction accuracy. How- ever, we discarded the brute force approach for 50 the later experiments because of its heavy com- putational burden.

40 Figure 4 and 5 have been plotted with the 0 5 10 15 20 same scale on the vertical axis for easing their Relaxation Order visual comparison. The global prediction strate- gy, which optimizes the overall sentence perplex- Figure 4. Accuracy of word prediction at n-gram ity, is much more useful for prediction as com- level. Models incorporating left and right context pared to the local predictions. Furthermore, as yield about a 20% improvement over one-sided seen from Figure 5, the global prediction has models. better resistance against higher order relaxations. We also observed that the local bidirectional Better accuracy has been obtained for sentence- model performed closely to the global bigram level prediction by using a bigram model and a model, up to relaxation order five, for which the

34 computation of the bidirectional model is still rectional bigram/trigram model from scratch is feasible. In Figure 6 we present a scaled version worth to be considered. As all known language of the plots to focus on the lower orders for com- model toolkits do not offer this function (even parison purposes. Although the global bigram the backward n-gram model is built by first re- prediction makes use of all words in the sentence versing the sentences manually), the discount- in order to yield the prediction, locally, a given ing/smoothing of the trigram has to be derived. word is only covered by two consecutive bi- The methods of Good-Turing, Kneser Ney, Ab- grams. Thus, the prediction of a word does not solute discounting, etc. (Chen and Goodman, depend on the second word before or the second 1998) can be imitated. word after. In other words, we could see the bidi- rectional model as a global model that is applied 4.3 Translation Performance to a “short segment” (in this case, a three word As frequent words have been ignored in the tar- segment). The only difference here is that the get language vocabulary of the machine transla- local bidirectional model estimates the probabili- tion system, the translation outputs will contain a ties from the corpus and keeps all seen “short great number of null tokens. The amount of null segment” probabilities in the language model (in tokens should approximate the cumulative fre- our case, it is derived from forward bigram), quencies shown in Figure 2. while the global bigram model optimizes the In this experiment, a word predictor was used probabilities by searching for the best two con- to recover all null tokens in the translation out- secutive bigrams. The global prediction might puts, and the recovered translations were eva- only show its advantage when predicting two or luated with the BLEU metric. All BLEU scores more consecutive null tokens. have been computed up to trigram precision. The word predictor used was the global tri- Bidirectional gram model, which was the best performing sys- 99 (local) Interpolation tem in word prediction experiments previously 97 (local) described. In this case, the predictor was used to Bigram (global) recover the null tokens in the translation outputs. 95 In order to apply the prediction mechanism as a post-processing step, a word lattice was built 93

Accuracy (%) Accuracy from each translation output, for which the null

91 word arcs were expanded with the words of the relaxed set. Finally, the lattice was decoded to 89 produce the best word list as the complete trans- lation output. 87 To evaluate whether a SMT system benefits 2 3 4 5 Relaxation Order from the relaxation strategy, we set up a baseline

system in which the relaxed words in the transla- Figure 6. Comparison among local bidirectional, tion output were replaced manually with null local interpolation, and global bigram models. tokens. After that, we used the same word pre- dictor as in the relaxed SMT case (global trigram Hence, we believe that, if the bidirectional bi- predictor) for recovering the null tokens and gram model is computed from counts and stored, regenerating the complete sentences. We then it could perform as good as global bigram model compared the translation output of the relaxed at much faster speed (as it involves only query- SMT system to the baseline system. ing the language model). Similarly, a local bidi- The results for the baseline (BL) and the re- rectional trigram model (actually a 5-gram) may laxed (RX) systems are shown in Figure 7. We be comparable to a global trigram model. evaluated the translation performance for relaxa- Deriving bidirectional n-gram probabilities tion orders of five and twenty. from a forward model is computationally expen- From the results shown in Figure 7, it be- sive. In the worst case scenario, where both comes evident that the translation task is not companion words are relaxed words, the compu- gaining any advantage from relaxation strategy tation complexity is in the order of ͉ʚ|͐||͌|ͧʛ, and did not outperform the baseline translator, where |͐| is the vocabulary size and |͌| is the neither at low nor at high orders of relaxation. number of relaxed words in ͐. Building a bidi-

35 Notice how the BLEU score of the baseline sys- step performed consistently for the relaxed SMT tems are better than those of the relaxed systems. systems, as for the baseline systems (for which the null tokens were inserted manually into sen- 0.38 tences). Since the word prediction is based on an DEV 0.37 n-gram model, we may deduce that the relaxed TEST SMT system preserves the syntactic structure of 0.36 the sentence as the null tokens in the translation 0.35 output could be recovered as accurate as in the case of the baseline system. 0.34

BLEU 0.33 5 Conclusion and Future Work 0.32 We have looked into the problem of predicting 0.31 the most frequently occurring words in a text.

0.3 The best of the studied word predictors, which is BL5 BL20 RX5 RX20 based on an n-gram language model and a global Translator search at the sentence level, has achieved 77% of accuracy when predicting 40% words in the text. Figure 7. BLEU scores for baseline (BL) and We also proposed the idea of relaxed SMT, relaxed (RX) translation systems at relaxation which consists of replacing top ranked words in orders of five and twenty. the target language vocabulary with a null token. This strategy was originally inspired by the con- We further analyzed these results by compu- cept of stop word removal in Information Re- ting BLEU scores for the translation outputs trieval, and later motivated by the finding that before and after the word prediction step. These text will become less confusing after relaxation. results are shown in Figure 8. Notice from Figure However, when relaxation is applied to the ma- 8 that the relaxed translators did not produce any chine translation system, our results indicate that better BLEU score than the corresponding base- the relaxed translation task is performing poorer line systems, even before word recovery. Al- than the conventional non-relaxed system. In though the text after relaxation is less confusing other words, the SMT system does not seem to (perplexity decreases about 15% after the twenty be benefiting from the word relaxation strategy, most frequent words are relaxed), the resulting at least in the case of the specific implementation perplexity drop was not translated into a BLEU studied here. score improvement. As future work, we will attempt to re-tailor the set of relaxed words by, for instance, impos-

0.4 Before ing some constraints to also include some less frequent function words, which may not be infor- After 0.35 mative to the translation system or, alternatively, 0.3 excluding some frequent semantically important words from the relaxed set. This remark is based 0.25 on the observation of the fifty most frequent 0.2 words in the EPPS dataset, such as president , BLEU 0.15 union , commission , European , and parliament , which could be harmful when ignored by the 0.1 translation system but also easy to predict. Hence 0.05 there is a need to study the effects of different sets of relaxed words on translation performance, 0 BL5 BL20 RX5 RX20 as it have already been done for the search prob- Translator lem by researchers in the area of Information Retrieval (Fox, 1990; Ibrahim, 2006). Figure 8. The BLEU scores before and after word prediction Acknowledgements

In terms of the word predictions shown in The authors would like to thank their correspond- Figure 8, we can see that this post-processing ing institutions: the Nanyang Technological Uni-

36 versity and the Institute for Infocomm Research, for their support regarding the development and publishing of this work.

References Ljiljana Dolamic and Jacques Savoy., 2009, When Andreas Stolcke, 2002, SRILM - An Extensible Lan- stopword lists make the difference, Journal of the guage Modeling Toolkit, in Proceedings of ICSLP , American Society for Information Science and 901-904. Technology , 61(1):1-4. Chris Callison-Burch, Cameron Fordyce, Philipp Patrick Ruch, Robert Baud and Antoine Geissbuhler, Koehn, Christof Monz and Josh Schroeder, 2007, 2001, Toward filling the gap between interactive (Meta-)evaluation of machine translation, in Pro- and fully-automatic spelling correction using the ceedings of the SMT Workshop , 136-158 linguistic context, in Proceedings of the 2001 IEEE Christopher Fox, 1990, A stop list for general text, International Conference on Systems, Man and SIGIR Forum , 24:19-35. Cybernetics , 199-204. Cornelis Joost van Rijsbergen, 1979, Information Stanley F. Chen and Joshua Goodman, 1998, An Retrieval , Butterworth-Heinemann. empirical study of smoothing techniques for lan- Feng Zou, Fu Lee Wang, Xiaotie Deng and Song Han, guage modeling, in Proceedings of the 34th Annual 2006, Automatic identification of Chinese stop Meeting of the Association for Computational Lin- words, Research on Comp. Science , 18:151-162. guistics , 310-318. Frank Thiele, Bernhard Rueber and Dietrich Klakow, Timothy C. Bell, John G. Cleary and Ian H. Witten., 2000, Long range language models for free spel- 1990, Text Compression , Prentice Hall. ling recognition, in Proceeding of the 2000 IEEE Tonio Wandmacher and Jean-Yves Antoine, 2007, ICASSP , 3:1715-1718. Methods to integrate a language model with se- Ibrahim Abu El-Khair, 2006, Effects of stop words mantic information for a word prediction compo- elimination for information retrieval: a nent, in Proceedings of the 2007 Joint Conference comparative study, International Journal of Com- on Empirical Methods in Natural Language puting & Information Sciences, 4(3):119-133. Processing and Computational Natural Language João Luís and Garcia Rosa, 2002, Next word predic- Learning , 506-513. tion in a connectionist distributed representation Yair Even-Zohar and Dan Roth, 2000, A classifica- system, in Proceedings of the 2002 IEEE Int. Con- tion approach to word prediction, in Proceedings of ference on Systems, Man and Cybernetics , 6-11. the 1st North American chapter of the Association Keith Trnka, John McCaw, Debra Yarrington, Kath- for Computational Linguistics , 124-131. leen F. McCoy, User interaction with word predic- tion: the effects of prediction quality, ACM Transaction on Accessible Computing , 1(17):1-34.

37 Natural Language Descriptions of Visual Scenes: Corpus Generation and Analysis

Muhammad Usman Ghani Khan Rao Muhammad Adeel Nawab Yoshihiko Gotoh University of Sheffield, United Kingdom {ughani, r.nawab, y.gotoh}@dcs.shef.ac.uk

Abstract streams, e.g., two humans may produce quite dif- ferent descriptions for the same video. Based on As video contents continue to expand, it is these descriptions we are interested to identify the increasingly important to properly annotate most important and frequent high level features videos for effective search, mining and re- trieval purposes. While the idea of anno- (HLFs); they may be ‘keywords’, such as a par- tating images with keywords is relatively ticular object and its position/moves, used for a well explored, work is still needed for anno- semantic indexing task in video retrieval. Mostly tating videos with natural language to im- HLFs are related to humans, objects, their moves prove the quality of video search. The fo- and properties (e.g., gender, emotion and action) cus of this work is to present a video dataset (Smeaton et al., 2009). with natural language descriptions which is In this paper we present these HLFs in the form a step ahead of keywords based tagging. We describe our initial experiences with a of ontologies and provides two hierarchical struc- corpus consisting of descriptions for video tures of important concepts — one most relevant segments crafted from TREC video data. for humans and their actions, and another for non Analysis of the descriptions created by 13 human objects. The similarity of video descrip- annotators presents insights into humans’ tions is quantified using a bag of word model. The interests and thoughts on videos. Such re- notion of sequence of events in a video was quan- source can also be used to evaluate auto- tified using the order preserving sequence align- matic natural language generation systems for video. ment algorithm (longest common subsequence). This corpus may also be used for evaluation of automatic natural language description systems. 1 Introduction This paper presents our experiences in manu- 1.1 Background ally constructing a corpus, consisting of natural The TREC video evaluation consists of on-going language descriptions of video segments crafted series of annual workshops focusing on a list of from a small subset of TREC video1 data. In information retrieval (IR) tasks. The TREC video a broad sense the task can be considered one promotes research activities by providing a large form of machine translation as it translates video test collection, uniform scoring procedures, and streams into textual descriptions. To date the a forum for research teams interested in present- number of studies in this field is relatively small ing their results. The high level feature extrac- partially because of lack of appropriate dataset tion task aims to identify presence or absence of for such task. Another obstacle may be inher- high level semantic features in a given video se- ently larger variation for descriptions that can be quence (Smeaton et al., 2009). Approaches to produced for videos than a conventional transla- video summarisation have been explored using tion from one language to another. Indeed hu- rushes video2 (Over et al., 2007). mans are very subjective while annotating video 2Rushes are the unedited video footage, sometimes re- 1www-nlpir.nist.gov/projects/trecvid/ ferred to as a pre-production video.

38 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 38–47, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics TREC video also provides a variety of meta There are some datasets specially generated for data annotations for video datasets. For the HLF scene settings such as MIT outdoor scene dataset task, transcripts, a list of mas- (Oliva and Torralba, 2009). Quattoni and Tor- ter shot references, and shot IDs having HLFs in ralba (2009) created indoor dataset with 67 differ- them are provided. Annotations are created for ent scenes categories. For most of these datasets shots (i.e., one camera take) for the summarisa- annotations are available in the form of keywords tion task. Multiple humans performing multiple (e.g., actions such as sit, stand, walk). They were actions in different backgrounds can be shown in developed for keyword search, object recognition one shot. Annotations typically consist of a few or event identification tasks. Rashtchian et al. phrases with several words per phrase. Human (2010) provided an interesting dataset of 1000 im- related features (e.g., their presence, gender, age, ages which contain natural language descriptions action) are often described. Additionally, camera of those images. motion and camera angle, ethnicity information In this study we select video clips from TREC and human’s dressing are often stated. On the video benchmark for creating annotations. They other hand, details relating to events and objects include categories such as news, meeting, crowd, are usually missing. Human emotion is another grouping, indoor/outdoor scene settings, traffic, missing information in many of such annotations. costume, documentary, identity, music, sports and animals videos. The most important and proba- 2 Corpus Creation bly the most frequent content in these videos ap- We are exploring approaches to natural language pears to be a human (or humans), showing their descriptions of video data. The step one of the activities, emotions and interactions with other study is to create a dataset that can be used for objects. We do not intend to derive a dataset development and evaluation. Textual annotations with a full scope of video categories, which is be- are manually generated in three different flavours, yond our work. Instead, to keep the task manage- i.e., selection of HLFs (keywords), title assign- able, we aim to create a compact dataset that can ment (a single phrase) and full description (mul- be used for developing approaches to translating tiple phrases). Keywords are useful for identifica- video contents to natural language description. tion of objects and actions in videos. A title, in a Annotations were manually created for a small sense, is a summary in the most compact form; it subset of data prepared form the rushes video captures the most important content, or the theme, summarisation task and the HLF extraction task of the video in a short phrase. On the other hand, for the 2007 and 2008 TREC video evaluations. a full description is lengthy, comprising of several It consisted of 140 segments of videos — 20 seg- sentences with details of objects, activities and ments for each of the following seven categories: their interactions. Combination of keywords, a ti- Action videos: Human posture is visible and hu- tle, and a full descriptions will create a valuable man can be seen performing some action resource for text based video retrieval and sum- such as ‘sitting’, ‘standing’, ‘walking’ and marisation tasks. Finally, analysis of this dataset ‘running’. provides an insight into how humans generate nat- ural language description for video. Close-up: Human face is visible. Facial expres- Most of previous datasets are related to spe- sions and emotions usually define mood of cific tasks; PETS (Young and Ferryman, 2005), the video (e.g., happy, sad). CAVIAR (Fisher et al., 2005) and Terrascope (Jaynes et al., 2005) are for surveillance videos. News: Presence of an anchor or reporters. Char- KTH (Schuldt et al., 2004) and the Hollywood ac- acterised by scene settings such as weather tion dataset (Marszalek et al., 2009) are for hu- boards at the background. man action recognition. MIT car dataset is for Meeting: Multiple humans are sitting and com- identification of cars (Papageorgiou and Poggio, municating. Presence of objects such as 1999). Caltech 101 and Caltech 256 are image chairs and a table. datasets with 101 and 256 object categories re- spectively (Griffin et al., 2007) but there is no Grouping: Multiple humans interaction scenes information about human actions or emotions. that do not belong to a meeting scenario. A

39 table or chairs may not be present.

Traffic: Presence of vehicles such as cars, buses and trucks. Traffic signals.

Indoor/Outdoor: Scene settings are more obvi- ous than human activities. Examples may be park scenes and office scenes (where com- puters and files are visible).

Each segment contained a single camera shot, spanning between 10 and 30 seconds in length. Two categories, ‘Close-up’ and ‘Action’, are mainly related to humans’ activities, expressions and emotions. ‘Grouping’ and ‘Meeting’ de- Figure 1: Video Description Tool (VDT). An anno- pict relation and interaction between multiple hu- tator watches one video at one time, selects all HLFs mans. ‘News’ videos explain human activities present in the video, describes a theme of the video as in a constrained environment such as a broad- a title and creates a full description for important con- cast studio. Last two categories, ‘Indoor/Outdoor’ tents in the video. and ‘Traffic’, are often observed in surveillance videos. They often shows for humans’ interac- tion with other objects in indoor and outdoor set- tings. TREC video annotated most video seg- ments with a brief description, comprising of mul- timeline capability and writes annotation for the tiple phrases and sentences. Further, 13 human specific time. subjects prepared additional annotation for these video segments, consisting of keywords, a title We have developed our own annotation tool be- and a full description with multiple sentences. cause of a few reasons. None of existing annota- They are referred to as hand annotations in the tion tools provided the functionality of generat- rest of this paper. ing a description and/or a title for a video seg- ment. Some tools allows selection of keywords in 2.1 Annotation Tool a free format, which is not suitable for our pur- There exist several freely available video anno- pose of creating a list of HLFs. Figure 1 shows tation tools. One of the popular video anno- a screen shot of the video annotation tool devel- tation tool is Simple Video Annotation tool 3. oped, which is referred to as Video Description It allows to place a simple tag or annotation Tool (VDT). VDT is simple to operate and assist on a specified part of the screen at a particular annotators in creating quality annotations. There time. The approach is similar to the one used by are three main items to be annotated. An anno- YouTube4. Another well-known video annotation tator is shown one video segment at one time. tool is Video Annotation Tool 5. A video can be Firstly a restricted list of HLFs is provided for scrolled for a certain time period and place anno- each segment and an annotator is required to se- tations for that part of the video. In addition, an lect all HLFs occurring in the segment. Second, annotator can view a video clip, mark a time seg- a title should be typed in. A title may be a theme ment, attach a note to the time segment on a video of the video, typically a phrase or a sentence with timeline, or play back the segment. ‘Elan’ anno- several words. Lastly, a full description of video tation tool allows to create annotations for both contents is created, consisting of several phrases audio and visual data using temporal information and sentences. During the annotation, it is pos- (Wittenburg et al., 2006). During that annotation sible to stop, forward, reverse or play again the process, a user selects a section of video using the same video if required. Links are provided for navigation to the next and the previous videos. An 3videoannotation.codeplex.com/ 4www.youtube.com/t/annotations about annotator can delete or update earlier annotations 5dewey.at.northwestern.edu/ppad2/documents/help/video.html if required.

40 2.2 Annotation Process A total of 13 annotators were recruited to create texts for the video corpus. They were undergradu- ate or postgraduate students and fluent in English. Hand annotation 1 It was expected that they could produce descrip- (title) interview in the studio; tions of good quality without detailed instructions (description) three people are sitting on a red ta- or further training. A simple instruction set was ble; a tv presenter is interviewing his guests; he is given, leaving a wide room for individual inter- talking to the guests; he is reading from papers in pretation about what might be included in the de- front of him; they are wearing a formal suit; scription. For quality reasons each annotator was Hand annotation 2 given one week to complete the full set of videos. (title) tv presenter and guests Each annotator was presented with a complete (description) there are three persons; the one is set of 140 video segments on the annotation tool host; others are guests; they are all men; VDT. For each video annotators were instructed to provide Hand annotation 3 (title) three men are talking • a title of one sentence long, indicating the (description) three people are sitting around the main theme of the video; table and talking each other;

• description of four to six sentences, related Figure 2: A montage showing a meeting scene in a to what are shown in the video; news video and three sets of hand annotations. In this video segment, three persons are shown sitting on • selection of high level features (e.g., male, chairs around a table — extracted from TREC video female, walk, smile, table). ‘20041116 150100 CCTV4 DAILY NEWS CHN33050028’.

The annotations are made with open vocabulary often associated with objects rather than humans. — that is, they can use any English words as long It is interesting to observe the subjectivity with the as they contain only standard (ASCII) characters. task; the variety of words were selected by indi- They should avoid using any symbols or computer vidual annotators to express the same video con- codes. Annotators were further guided not to use tents. Figure 3 shows another example of a video proper nouns (e.g., do not state the person name) segment for a human activity and hand annota- and information obtained from audio. They were tions. also instructed to select all HLFs appeared in the video. 3.1 Human Related Features

3 Corpus Analysis After removing function words, the frequency for each word was counted in hand annotations. Two 13 annotators created descriptions for 140 videos classes are manually defined; one class is related (seven categories with 20 videos per category), directly to humans, their body structure, identity, resulting in 1820 documents in the corpus. The action and interaction with other humans. (An- total number of words is 30954, hence the av- other class represents artificial and natural objects erage length of one document is 17 words. We and scene settings, i.e., all the words not directly counted 1823 unique words and 1643 keywords related to humans, although they are important (nouns and verbs). for semantic understanding of the visual scene — Figure 2 shows a video segment for a meet- described further in the next section.) Note that ing scene, sampled at 1 fps (frame per second), some related words (e.g., ‘woman’ and ‘lady’) and three examples for hand annotations. They were replaced with a single concept (‘female’); typically contain two to five phrases or sentences. concepts were then built up into a hierarchical Most sentences are short, ranging between two to structure for each class. six words. Descriptions for human, gender, emo- Figure 4 presents human related information tion and action are commonly observed. Occa- observed in hand annotations. Annotators paid sionally minor details for objects and events are full attention to human gender information as the also stated. Descriptions for the background are number of occurrences for ‘female’ and ‘male’ is

41 Figure 4: Human related information found in 13 hand annotations. Information is divided into structures (gen- der, age, identity, emotion, dressing, grouping and body parts) and activities (facial, hand and body). Each box contains a high level concept (e.g.,‘woman’ and ‘lady’ are both merged into ‘female’) and the number of its occurrences.

the highest among HLFs. This highlights our con- clusion that most interesting and important HLF is humans when they appear in a video. On the other hand age information (e.g., ‘old’, ‘young’, ‘child’) was not identified very often. Names for human body parts have mixed occurrences rang- ing from high (‘hand’) to low (‘moustache’). Six basic emotions — anger, disgust, fear, happiness, Hand annotation 1 sadness, and surprise as discussed by Paul Ek- (title) outdoor talking scene; man6 — covered most of facial expressions. (description) young woman is sitting on chair in park and talking to man who is standing next to Dressing became an interesting feature when her; a human was in a unique dress such as a formal suit, a coloured jacket, an army or police uni- Hand annotation 2 form. Videos with multiple humans were com- (title) A couple is talking; mon, and thus human grouping information was (description) two person are talking; a lady is sit- frequently recognised. Human body parts were ting and a man is standing; a man is wearing a involved in identification of human activities; they black formal suit; a red bus is moving in the street; people are walking in the street; a yellow taxi is included actions such as standing, sitting, walk- moving in the street; ing, moving, holding and carrying. Actions re- lated to human body and posture were frequently Hand annotation 3 identified. It was rare that unique human identi- (title) talk of two persons; ties, such as police, president and prime minister, (description) a man is wearing dark clothes; he is were described. This may indicate that a viewer standing there; a woman is sitting in front of him; might want to know a specific type of an object they are saying to each other; to describe a particular situation instead of gener- Figure 3: A montage of video showing a human activ- alised concepts. ity in an outdoor scene and three sets of hand annota- tions. In this video segment, a man is standing while 3.2 Objects and Scene Settings a woman is sitting in outdoor — from TREC video Figure 5 shows the hierarchy created for HLFs ‘20041101 160000 CCTV4 DAILY NEWS CHN 41504210’. that did not appear in Figure 4. Most of the words are related to artificial objects. Humans inter- act with these objects to complete an activity —

6en.wikipedia.org/wiki/Paul Ekman

42 in (404); with (120); on (329); near (68); around (63); at (55); on the left (35); in front of (24); down (24); together (24); along (16); beside (16); on the right (16); into (14); far (11); between (10); in the middle (10); outside (8); off (8); over (8); pass-by (8); across (7); inside (7); middle (7); un- der (7); away (6); after (7)

Figure 6: List of frequent spatial relations with their counts found in hand annotations.

static: relations between stationary objects; dynamic: direction and path of moving objects; inter-static and dynamic: relations between moving and not moving objects. Figure 5: Artificial and natural objects and scene set- tings were summarised into six groups. Static relations can establish the scene settings (e.g., ‘chairs around a table’ may imply an indoor scene). Dynamic relations are used for finding ac- e.g., ‘man is sitting on a chair’, ‘she is talking tivities present in the video (e.g., ‘a man is run- on the phone’, ‘he is wearing a hat’. Natural ob- ning with a dog’). Inter-static and dynamic rela- jects were usually in the background, providing tions are a mixture of stationary and non station- the additional context of a visual scene — e.g., ary objects; they explain semantics of the com- ‘human is standing in the jungle, ‘sky is clear to- plete scene (e.g., ‘persons are sitting on the chairs day’. Place and location information (e.g., room, around the table’ indicates a meeting scene). office, hospital, cafeteria) were important as they show the position of humans or other objects in 3.4 Temporal Relations the scene — e.g., ‘there is a car on the road, ‘peo- Video is a class of time series data formed with ple are walking in the park ’. highly complex multi dimensional contents. Let Colour information often plays an important video X be a uniformly sampled frame sequence part in identifying separate HLFs — e.g., ‘a man of length n, denoted by X = {x1,...,xn}, and each frame xi gives a chronological position of in black shirt is walking with a woman with green the sequence (Figure 7). To generate full de- jacket’, ‘she is wearing a white uniform’. The scription of video contents, annotators use tempo- large number of occurrences for colours indicates ral information to join descriptions of individual human’s interest in observing not only objects but frames. For example, also their colour scheme in a visual scene. Some A man is walking. After sometime he en- hand descriptions reflected annotator’s interest in ters the room. Later on he is sitting on the scene settings shown in the foreground or in the chair. background. Indoor/outdoor scene settings were Based on the analysis of the corpus, we describe also interested in by some annotators. These ob- temporal information in two flavors: servations demonstrate that a viewer is interested in high level details of a video and relationships 1. temporal information extracted from activi- between different prominent objects in a visual ties of a single human; scene. 2. interactions between multiple humans. 3.3 Spatial Relations Most common relations in video sequences are Figure 6 presents a list of the most frequent words ‘before’, ‘after’, ‘start’ and ‘finish’ for single hu- and phrases related to spatial relations found in mans, and ‘overlap’, ‘during’ and ‘meeting’ for hand annotations. Spatial relations between HLFs are important when explaining the semantics of multiple humans. visual scenes. Their effective use leads to the Figure 8 presents a list of the most frequent smooth description. Spatial relations can be cate- words in the corpus related to temporal relations. gorised into It can be observed that annotators put much focus

43 equal to one. Values for the overlap coefficient range between 0 and 1, where ‘0’ presents the sit- uation where documents are completely different and ‘1’ describes the case where two documents Figure 7: Illustration of a video as a uniformly sam- are exactly the same. pled sequence of length n. A video frame is denoted Table 1 shows the average overlap similarity by xi, whose spatial context can be represented in the scores for seven scene categories within 13 hand d dimensional feature space. annotations. The average was calculated from scores for individual description, that was com- single human: then (25); end (24); before (22); after (16); next pared with the rest of descriptions in the same (12); later on (12); start (11); previous (11); category. The outcome demonstrate the fact that throughout (10); finish (8); afterwards (6); prior humans have different observations and interests to (4); since (4) while watching videos. Calculation were repeated multiple humans: with two conditions; one with stop words re- meet (114); while (37); during (27); at the same moved and Porter stemmer (Porter, 1993) applied, time (19); overlap (12); meanwhile (12); through- but synonyms NOT replaced, and the other with out (7); equals (4) stop words NOT removed, but Porter stemmer ap- Figure 8: List of frequent temporal relations with their plied and synonyms replaced. It was found the counts found in hand annotations. latter combination of preprocessing techniques re- sulted in better scores. Not surprisingly synonym on keywords related to activities of multiple hu- replacement led to increased performance, indi- mans as compared to single human cases. ‘Meet’ cating that humans do express the same concept keyword has the highest frequency, as annota- using different terms. tors usually consider most of the scenes involving The average overlap similarity score was higher multiple humans as the meeting scene. ‘While’ for ‘Traffic’ videos than for the rest of categories. keyword is mostly used for showing separate ac- Because vehicles were the major entity in ‘Traf- tivities of multiple humans such as ‘a man is walk- fic’ videos, rather than humans and their actions, ing while a woman is sitting’. contributing for annotators to create more uniform descriptions. Scores for some other categories 3.5 Similarity between Descriptions were lower. It probably means that there are more A well-established approach to calculating human aspects to pay attention when watching videos in, inter-annotator agreement is kappa statistics (Eu- e.g., ‘Grouping’ category, hence resulting in the genio and Glass, 2004). However in the current wider range of natural language expressions pro- task it is not possible to compute inter-annotator duced. agreement using this approach because no cat- egory was defined for video descriptions. Fur- 3.6 Sequence of Events Matching ther the description length for one video can vary Video is a class of time series data which can among annotators. Alternatively the similarity be- be partitioned into time aligned frames (images). tween natural language descriptions can be calcu- These frames are tied together sequentially and lated; an effective and commonly used measure temporally. Therefore, it will be useful to know to find the similarity between a pair of documents how a person captures the temporal information is the overlap similarity coefficient (Manning and present in a video. As the order is preserved in a Sch¨utze, 1999): sequence of events, a suitable measure to quantify |S(X,n) ∩ S(Y,n)| sequential and temporal information of a descrip- Simoverlap(X,Y )= tion is the longest common subsequence (LCS). min(|S(X,n)|, |S(Y,n)|) This approach computes the similarity between where S(X,n) and S(Y,n) are the set of distinct a pair of token (i.e., word) sequences by simply n-grams in documents X and Y respectively. It is counting the number of edit operations (insertions a similarity measure related to the Jaccard index and deletions) required to transform one sequence (Tan et al., 2006). Note that when a set X is a sub- into the other. The output is a sequence of com- set of Y or the converse, the overlap coefficient is mon elements such that no other longer string is

44 Action Close-up Indoor Grouping Meeting News Traffic unigram(A) 0.3827 0.3913 0.4217 0.3809 0.3968 0.4378 0.4687 (B) 0.4135 0.4269 0.4544 0.4067 0.4271 0.4635 0.5174 bigram(A) 0.1483 0.1572 0.1870 0.1605 0.1649 0.1872 0.1765 (B) 0.2490 0.2616 0.2877 0.2619 0.2651 0.2890 0.2825 trigram(A) 0.0136 0.0153 0.0301 0.0227 0.0219 0.0279 0.0261 (B) 0.1138 0.1163 0.1302 0.1229 0.1214 0.1279 0.1298

Table 1: Average overlapping similarity scores within 13 hand annotations. For each of unigram, bigram and trigram, scores are calculated for seven categories in two conditions: (A) stop words removed and Porter stemmer applied, but synonyms NOT replaced; (B) stop words NOT removed, but Porter stemmer applied and synonyms replaced.

raw synonym keyword a measure of the importance of a term within a Action 0.3782 0.3934 0.3955 particular document, calculated as Close-up 0.4181 0.4332 0.4257 Indoor 0.4248 0.4386 0.4338 tfidf(t, d)= tf(t, d) · idf(d) (1) Grouping 0.3941 0.4104 0.3832 Meeting 0.3939 0.4107 0.4124 where the term frequency tf(t, d) is given by News 0.4382 0.4587 0.4531 Traffic 0.4036 0.4222 0.4093 Nt,d tf(t, d)= (2) Table 2: Similarity scores based on the longest com- X Nk,d mon subsequence (LCS) in three conditions: scores k without any preprocessing (raw), scores after synonym replacement (synonym), and scores by keyword com- In the above equation Nt,d is the number of occur- parison (keyword). For keyword comparison, verbs rences of term t in document d, and the denomina- and nouns were presented as keywords after stemming tor is the sum of the number of occurrences for all and removing stop words. terms in document d, that is, the size of the docu- ment |d|. Further the inverse document frequency available. In the experiments, the LCS score be- idf(d) is N tween word sequences is normalised by the length idf(d) = log (3) of the shorter sequence. W (t) Table 2 presents results for identifying se- where N is the total number of documents in the quences of events in hand descriptions using the corpus and W (t) is the total number of document LCS similarity score. Individual descriptions containing term t. were compared with the rest of descriptions in the A term-document matrix X is presented by same category and the average score was calcu- T × D matrix tfidf(t, d). In the experiment lated. Relatively low scores in the table indicate Naive Bayes probabilistic supervised learning al- the great variation in annotators’ attention on the gorithm was applied for classification using Weka sequence of events, or temporal information, in a machine learning library (Hall et al., 2009). Ten- video. Events described by one annotator may not fold cross validation was applied. The perfor- have been listed by another annotator. The News mance was measured using precision, recall and videos category resulted in the highest similarity F1-measure (Table 3). F1-measure was low for score, confirming the fact that videos in this cate- ‘Grouping’ and ‘Action’ videos, indicating the gory are highly structured. difficulty in classifying these types of natural lan- guage descriptions. Best classification results 3.7 Video Classification were achieved for ‘Traffic’ and ‘Indoor/Outdoor’ To demonstrate the application of this corpus with scenes. Absence of humans and their actions natural language descriptions, a supervised docu- might have contributed obtaining the high clas- ment classification task is outlined. Tf-idf score sification scores. Human actions and activities can express textual document features (Dumais et were present in most videos in various categories, al., 1998). Traditional tf-idf represents the rela- hence the ‘Action’ category resulted in the low- tion between term t and document d. It provides est results. ‘Grouping’ category also showed

45 precision recall F1-measure Action 0.701 0.417 0.523 Close-up 0.861 0.703 0.774 Grouping 0.453 0.696 0.549 Indoor 0.846 0.915 0.879 Meeting 0.723 0.732 0.727 News 0.679 0.823 0.744 Traffic 0.866 0.869 0.868 average 0.753 0.739 0.736

Table 3: Results for supervised classification using the Figure 10: Video classification by titles, and by de- tf-idf features. scriptions.

formance with titles and with descriptions. We were able to make correct classification in many videos with titles alone, although the performance was slightly less for titles only than for descrip- tions.

4 Conclusion and Future Work Figure 9: The average overlap similarity scores for This paper presented our experiments using a cor- titles and for descriptions. ‘uni’, ‘bi’, and ‘tri’ indi- cate the unigram, bigram, and trigram based similarity pus created for natural language description of scores, respectively. They were calculated without any videos. For a small subset of TREC video data in preprocessing such as stop word removal or synonym seven categories, annotators produced titles, de- replacement. scriptions and selected high level features. This paper aimed to characterise the corpus based on analysis of hand annotations and a series of exper- weaker result; it was probably because process- iments for description similarity and video clas- ing for interaction between multiple people, with sification. In the future we plan to develop au- their overlapped actions, had not been fully devel- tomatic machine annotations for video sequences oped. Overall classification results are encourag- and compare them against human authored anno- ing which demonstrates that this dataset is a good tations. Further, we aim to annotate this corpus resource for evaluating natural language descrip- in multiple languages such as Arabic and tion systems of short videos. to generate a multilingual resource for video pro- 3.8 Analysis of Title and Description cessing community. A title may be considered a very short form of Acknowledgements summary. We carried out further experiments to M U G Khan thanks University of Engineering & calculate the similarity between a title and a de- Technology, Lahore, Pakistan and R M A Nawab scription manually created for a video. The length thanks COMSATS Institute of Information Tech- of a title varied between two to five words. Figure nology, Lahore, Pakistan for funding their work 9 shows the average overlapping similarity scores under the Faculty Development Program. between titles and descriptions. It can be ob- served that, in general, scores for titles were lower than those for descriptions, apart from ‘News’ and ‘Meeting’ videos. It was probably caused by the short length of titles; by inspection we found phrases such as ‘news video’ and ‘meeting scene’ for these categories. Another experiment was performed for classi- fication of videos based on title information only. Figure 10 shows comparison of classification per-

46 References A.F. Smeaton, P. Over, and W. Kraaij. 2009. High- level feature detection from video in trecvid: a S. Dumais, J. Platt, D. Heckerman, and M. Sahami. 5-year retrospective of achievements. Multimedia 1998. Inductive learning algorithms and represen- Content Analysis, pages 1–24. tations for text categorization. In Proceedings of P.N. Tan, M. Steinbach, V. Kumar, et al. 2006. Intro- the seventh international conference on Information duction to data mining. Pearson Addison Wesley and knowledge management, pages 148–155.ACM. Boston. B.D. Eugenio and M. Glass. 2004. The kappa statis- P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, tic: A second look. Computational linguistics, and H. Sloetjes. 2006. Elan: a professional frame- 30(1):95–101. work for multimodality research. In Proceedings of R. Fisher, J. Santos-Victor, and J. Crowley. 2005. LREC, volume 2006. Citeseer. Caviar: Context aware vision using image-based ac- D.P. Young and J.M. Ferryman. 2005. Pets metrics: tive recognition. On-line performance evaluation service. In Joint G. Griffin, A. Holub, and P. Perona. 2007. Caltech- IEEE International Workshop on Visual Surveil- 256 object category dataset. lance and Performance Evaluation of Tracking and M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute- Surveillance (VS-PETS), pages 317–324. mann, and I.H. Witten. 2009. The weka data min- ing software: an update. ACM SIGKDD Explo- rations Newsletter, 11(1):10–18. C. Jaynes, A. Kale, N. Sanders, and E. Gross- mann. 2005. The terrascope dataset: A scripted multi-camera indoor video surveillance dataset with ground-truth. In Proceedings of the IEEE Workshop on VS PETS, volume 4. Citeseer. Christopher Manning and Hinrich Sch¨utze. 1999. Foundations of Statistical Natural Language Pro- cessing. MIT Press. M. Marszalek, I. Laptev, and C. Schmid. 2009. Ac- tions in context. A. Oliva and A. Torralba. 2009. Mit outdoor scene dataset. P. Over, A.F. Smeaton, and P. Kelly. 2007. The trecvid 2007 bbc rushes summarization evaluation pilot. In Proceedings of the international work- shop on TRECVID video summarization, pages 1– 15. ACM. C. Papageorgiou and T. Poggio. 1999. A trainable object detection system: Car detection in static im- ages. Technical Report 1673, October. (CBCL Memo 180). M.F. Porter. 1993. An algorithm for suffix stripping. Program: electronic library and information sys- tems, 14(3):130–137. A. Quattoni and A. Torralba. 2009. Recognizing in- door scenes. C. Rashtchian, P. Young, M. Hodosh, and J. Hocken- maier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 139–147. Association for Computa- tional Linguistics. C. Schuldt, I. Laptev, and B. Caputo. 2004. Recog- nizing human actions: A local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceed- ings of the 17th International Conference on, vol- ume 3, pages 32–36. IEEE.

47 Combining EBMT, SMT, TM and IR Technologies for Quality and Scale

Sandipan Dandapat1, Sara Morrissey1, Andy Way2, Joseph van Genabith1 1 CNGL, School of Computing Dublin City University, Glasnevin, Dublin 9, Ireland {sdandapat,smorri,josef}@computing.dcu.ie 2 Applied Language Solutions, Delph, UK [email protected]

Abstract 2011), CMU-EBMT (Brown, 2011) and OpenMa- TrEx (Dandapat et al., 2010). The success of an In this paper we present a hybrid statisti- SMT system often depends on the amount of parallel cal machine translation (SMT)-example-based training corpora available for the particular language MT (EBMT) system that shows significant pair. However, low translation accuracy has been improvement over both SMT and EBMT base- observed for language pairs with limited training re- line systems. First we present a runtime sources (Islam et al., 2010; Khalilov et al., 2010). EBMT system using a subsentential transla- tion memory (TM). The EBMT system is fur- SMT systems effectively discard the actual training ther combined with an SMT system for effec- data once the models (translation model and lan- tive hybridization of the pair of systems. The guage model) have been estimated. This can lead to hybrid system shows significant improvement their inability to guarantee good quality translation in translation quality (0.82 and 2.75 abso- for sentences closely matching those in the train- lute BLEU points) for two different language ing corpora. By contrast, EBMT systems usually pairs (English–Turkish (En–Tr) and English– maintain a linked relationship between the full sen- French (En–Fr)) over the baseline SMT sys- tence pairs in source and target texts. Because of this tem. However, the EBMT approach suffers from significant time complexity issues for a EBMT systems can often capture long range depen- runtime approach. We explore two methods to dencies and rich morphology at runtime. In contrast make the system scalable at runtime. First, we to SMT, however, most EBMT models lack a well- use an heuristic-based approach. Secondly, we formed probability model, which restricts the use of use an IR-based indexing technique to speed statistical information in the translation process. up the time-consuming matching procedure of the EBMT system. The index-based match- Keeping these in mind, our objective is to de- ing procedure substantially improves run-time velop a good quality MT system choosing the best speed without affecting translation quality. approach for each input in the form of a hybrid SMT- EBMT approach. It is often the case that an EBMT system produces a good translation where SMT sys- 1 Introduction tems fail and vice versa (Dandapat et al., 2011). An EBMT system relies on past translations to State-of-the-art phrase-based SMT (Koehn, 2010a) derive the target output for a given input. Run- is the most successful MT approach in many large time EBMT approaches generally do not include scale evaluations, such as WMT,1 IWSLT2 etc. At any training stage, which has the advantage of not the same time, work continues in the area of EBMT. having to depend on time-consuming preprocessing. Some recent EBMT systems include Cunei (Phillips, On the other hand, their runtime complexity can be 1http://www.statmt.org/wmt11/ considerable. This is due to the time-consuming 2http://www.iwslt2011.org/ matching stage at runtime that finds the example

48 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 48–58, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics (or set of examples) which most closely matches training stage are commonly called “compiled ap- the source-language sentence to be translated. This proaches” (Cicekli and Guvenir,¨ 2001). Approaches matching step often uses some variation of string that do not include a training stage are often referred edit-distance measures (Levenshtein, 1965) which to as “pure” or “runtime” EBMT approaches, e.g. has quadratic time complexity.3 This is quite time- (Lepage and Denoual, 2005). These approaches consuming even when a moderate amount of train- have the advantage that they do not depend on any ing examples are used for the matching procedure. time-consuming preprocessing stages. On the other We adopt two alternative approaches to tackle the hand, their runtime complexity can be considerable. above problem. First we use heuristics which are of- EBMT is often linked with the related concept of ten useful to avoid some of the computations. For a translation memory (TM). A TM essentially stores input sentence, in the matching process, we may not source- and target-language translation pairs for ef- need to compute the string edit distance with all sen- fective reuse of previous translations originally cre- tences in the example base. In order to prune some ated by human translators. TMs are often used to of the computation, we rely on the fact that the in- store examples for EBMT systems. After retriev- put sentence and its closest match sentence from the ing a set of examples with associated translations, example-base are likely to have a similar sentence EBMT systems automatically extract translations of length. Search engine indexing is an effective way suitable fragments and combine them to produce a of storing data for fast and accurate retrieval of in- grammatical target output. formation. During retrieval, a set of documents are Phrase-based SMT systems (Koehn, 2010a), pro- extracted based on their similarity to the input query. duce a source–target aligned subsentential phrase In our second approach, we use this concept to effi- table which can be adapted as an additional TM ciently retrieve a potential set of suitable candidate to be used in a CAT environment (Simard, 2003; sentences from the example-base to find the closest Bic¸ici and Dymetman, 2008; Bourdaillet et al., match. We index the entire example-base consider- 2009; Simard and Isabelle, 2009). Koehn and Senel- ing each source-side sentence as a document for the lart (2010b) use SMT to produce the translation of indexer. We show that improvements can be made the non-matched fragments after obtaining the TM- with our approach in terms of time complexity with- based match. EBMT phrases have also been used out affecting the translation quality. to populate the knowledge database of an SMT sys- The remainder of this paper is organized as fol- tem (Groves et al., 2006). However, to the best of lows. The next section presents work related to our our knowledge, the use of SMT phrase tables within EBMT approach. Section 3 describes the MT sys- an EBMT system as an additional sub-sentential TM tems used in our experiments. Section 4 focuses on has not been attempted so far. Some work has been the two techniques used to make the system scalable. carried out to integrate MT in a CAT environment Section 5 presents the experiments in detail. Section to translate the whole segment using the MT sys- 6 presents and discusses the results and provides an tem when no sufficiently well matching translation error analysis. We conclude in Section 7. unit (TU) is found in the TM. The TransType sys- tem (Langlais et al., 2002) integrates an SMT sys- 2 Related Work tem within a text editor to suggest possible continua- The EBMT framework was first introduced by Na- tions of the translations being typed by the translator. gao (1984) as the “MT by analogy principle”. The By contrast, our approach attempts to integrate the two main approaches to EBMT are distinguished subsentential TM obtained using SMT techniques by the inclusion or exclusion of a preprocess- within an EBMT system. ing/training stage. Approaches that incorporate a 3 MT Systems 3Ukkonen (1983) gave an algorithm for computing edit- distance with the worst case complexity O(md), where m is The SMT system used in our hybrid SMT- the length of the string and d is their edit distance. This is ef- EBMT approach is the vanilla Moses4 decoder. fective when m  d. We use word-based edit distance, so m is shorter in length. 4http://www.statmt.org/moses/

49 Moses (Koehn et al., 2007) is a set of SMT tools that include routines to automatically train a transla- Table 1: Moses phrase equivalence probabilities. tion model for any language pair and an efficient de- English (s) Turkish (t) p(t|s) lex(t|s) a hotel bir otel 0.826087 0.12843 coder to find the most probable translation. Due to a hotel bir otelde 0.086957 0.07313 lack of space and the wide usage of Moses, here we a hotel otel mi 0.043478 0.00662 focus more on the novel EBMT system we have de- a hotel otel 0.043478 0.22360 veloped for our hybrid SMT-EBMT approach. The EBMT system described in this section is based on previous work (Dandapat et al., 2010) and some of above probabilities. This helps us in the matching the material has been reproduced here to make the procedure, but during recombination we only con- paper complete. sider the most probable target equivalent. The fol- Like all other EBMT systems, our particular ap- lowing shows the resulting TUs in the TM for the proach comprises three stages: matching, alignment English source phrase a hotel. and recombination. Our EBMT system also uses a a hotel ⇔ {bir otel, bir otelde, otel, otem mi} subsentential TM in addition to the sentence aligned example-base. Using the original TM as a train- Secondly, we add entries to the TM based on the ing set, additional subsentential TUs (words and source-to-target word-aligned file. We also keep phrases) are extracted from it based on word align- the multiple target equivalents for a source word in ments and phrase pairs produced by Moses. These a sorted order. This essentially adds source- and subsentential TUs are used for alignment and recom- target-language equivalent word pairs into the TM. bination stages of our EBMT system. Note that the entries in the TM may contain in- correct source-target equivalents due to unreliable 3.1 Building a Subsentential TM for EBMT word/phrase alignments produced by Moses. A TM for EBMT usually contains TUs linked at 3.2 EBMT Engine the sentence, phrasal and word level. TUs can be derived manually or automatically (e.g. using the The overview of the three stages of the EBMT en- marker-hypothesis (Groves et al., 2006)). Usually, gine is given below: TUs are linguistically motivated translation units. Matching: In this stage, we find a sentence pair In this paper however, we explore a different route, hs , t i from the example-base that closely matches as manual construction of high-quality TMs is time c c with the input sentence s. We used a fuzzy-match consuming and expensive. Furthermore, only con- score (FMS) based on a word-level edit distance sidering linguistically motivated TUs may limit the metric (Wagner and Fischer, 1974) to find the closest matching potential of a TM. Because of this, we matching source-side sentence from the example- used SMT technology to automatically create the base ({s }N ) based on Equation (i). subsentential part of our TM at the phrase (i.e. i 1 no longer necessarily linguistically motivated) and score(s, s ) = 1 − ED(s, s )/ max(|s|, |s |)(i) word level. Based on Moses word alignment (using i i i GIZA++ (Och and Ney, 2003)) and phrase table con- where |x| denotes the length (in words) of a sen- struction, we construct the additional TM for further tence, and ED(x, y) refers to the word-level edit dis- use within an EBMT approach. tance between x and y. The EBMT system considers Firstly, we add entries to the TM based on the the associated translation tc of the closest matching aligned phrase pairs from the Moses phrase table us- source sentence sc, to build a skeleton for the trans- ing the following two scores: lation of the input sentence s. 1. Direct phrase translation probabilities: φ(t|s) 2. Direct lexical weight: lex(t|s) Alignment: After retrieving the closest fuzzy- Table 1 shows an example of phrase pairs with the matched sentence pair hsc, tci, we identify the non- associated probabilities learned by Moses. We keep matching fragments from the skeleton translation tc all target equivalents in a sorted order based on the in two steps.

50 Firstly, we find the matched and non-matched Algorithm 1 recombination(X, TM) segments between s and sc using edit distance In: source segment X, subsentential translation memory TM trace. Given the two sentences (s and sc), the al- gorithm finds the minimum possible number of op- Out: translation of source segment X 1: mark all words of X as untranslated erations (substitutions, additions and deletions) re- (untranslatedPortions(X) ← {X}) quired to change the closest match sc into the in- 2: repeat put sentence s. For example, consider the input 3: U = untranslatedPortions(X) 4: x = longest subsegment in untranslatedPortions(X) sentence s = w1 w2 w3 w4 w5 w6 w7 w8 and sc = 0 0 0 such that (x, tx) ∈ TM; w w w4 w5 w7 w8 w . Figure 1 shows the matched 1 3 9 5: substitute(X, x → tx) {substitute x with its target and non-matched sequence between s and sc using equivalent tx in X} edit-distance trace. 6: remove x from untranslatedPortions(X)

s = w1 w2 w3 w4 w5 w6 w7 w8 − 7: until (untranslatedPortions(X) = U) | | | | | 8: return X 0 0 sc = w1 − w3 w4 w5 − w7 w8 w9 ⇓ pair hsc, tci is shown in (1b) and (1c). The portion s = w1 w2 w3 w4 w5 w6 w7 w8 null marked with angled brackets in (1c) are aligned with | ↓ | ↓ | ↓ the mismatched portion in (1b). The character and 0 0 sc = w1 w3 w4 w5 null w7 w8 w9 the following number in angled brackets indicate the edit operation (‘s’ indicates substitution) and the in- Figure 1: Extraction of matched (underlined) and non- dex of the mismatched segment from the alignment matched (boxed) segments between s and sc. process respectively. Secondly, we align each non-matched segment in 1. (a) s: i ’d like a for . sc with its associated translation using the TM and the GIZA++ alignment. Based on the source-target (b) sc: i ’d like a for . aligned pair in the TM, we mark the mismatched segment in t . We find the longest possible seg- (c) tc: ic¸in bir c istiyorum . ment from the non-matched segment in sc that has a During recombination, we need to replace two matching target equivalent in tc based on the source- target equivalents in the TM. We continue the pro- segments in (1c) {yaglı˘ sac¸lar (greasy hair) and cess recursively until no further segments of the non- s¸ampuan (shampoo)} with the two corresponding source segments in (1a) {my mother and present} matched segment in sc can be matched with tc us- ing the TM. Remaining non-matching segments in as an intermediate stage (2) along the way towards producing a target equivalent. sc are then aligned with segments in tc using the GIZA++ word alignment information. (2)<1:my mother> ic¸in bir <0:present> istiyorum . Recombination: In the recombination stage, we Furthermore, replacing the untranslated segments add or substitute segments from the input sentence s in (2) with the translations obtained using TM, we with the skeleton translation equivalent tc. We also derive the output translation in (3) of the original in- delete some segments from tc that have no corre- put sentence in (1). spondence in s. After obtaining the source segments (3) ic¸in bir istiyorum . (needs to be added or substituted in tc) from the in- put s, we use our subsentential TM to translate these 4 Scalability segments. Details of the recombination process are given in Algorithm 1. The main motivation of scalability is to improve the speed of the EBMT system when using a large 3.3 An Illustrative Example example-base. The matching procedure in an EBMT As a running example, for the input sentence in (1a) system finds the example (or a set of examples) the corresponding closest fuzzy-matched sentence which closely matches the source-language string to

51 be translated. All matching processes necessarily in- to contain the closest match sentence) from the volve a distance or similarity measure. The most example-base. Unigrams extracted from the sen- widely used distance measure in EBMT matching tences of the example-base are indexed using the is Levenshtein distance (Levenshtein, 1965; Wagner language model (LM) and complete sentences are and Fischer, 1974) which has quadratic time com- considered as retrievable units. In LM-based re- plexity. In our EBMT system, we find the clos- trieval we assume that a given query is generated est sentence at runtime from the whole example- from a unigram document language model. The ap- base for a given input sentence using the edit dis- plication of the LM retrieval model in our case re- tance matching score. Thus, the matching step of turns a sorted list of sentences from the example- the EBMT system is a time-consuming process with base ordered by the estimated probabilities of gen- a runtime complexity of O(nm2), where n denotes erating the given input sentence. the size of the example-base and m denotes the av- In order to improve the run-time performance, erage length (in words) of a sentence. Due to a we integrate the SMART retrieval engine within the significant runtime complexity, the EBMT system matching procedure of our EBMT system. The re- can only handle a moderate size example-base in the trieval engine estimates a potential set of candidate matching stage. However, it is important to handle a close-matching sentences from the example-base E large example-base to improve the quality of an MT for a test sentence s. We assume that the closest system. In order to make the system scalable with source-side match sc of the input sentence s can a larger example-base, we adopt two approaches for take the value from the set EIR(s), where EIR(s) is finding the closest matching sentences efficiently. the potential set of close-matching sentences com- puted by the LM-based retrieval engine. We have 4.1 Grouping used the top 50 candidate sentences from EIR(s). Our first attempt is heuristic-based. We divide the Since the IR engine tries to retrieve the document example-base into bins based on sentence length. It (sentences from E) for a given query (input) sen- is anticipated that the sentence from the example- tence, it is likely to retrieve the closest match sen- base that most closely matches an input sentence tence sc in the set EIR(s). Due to a much re- will fall into the group which has comparable length duced set of possibilities, this approach improves the to the length of the input sentence. First, we divide run-time performance of the EBMT system without the example-base E into different bins based on their hampering system accuracy. Finding this potential Sl T word-level length E = i=1 Ei and Ei Ej = ∅ set of candidate sentences will be much faster than for all i 6= j where 0 ≤ i, j ≤ l. Ei denotes the traditional edit-distance-based retrieval on the full set of sentences with length i and l is the maximum example-base as the worst case run time of the re- length of a sentence in E. In order to find the clos- triever is O(P s ), where w is a word in the in- ∀wi i i est match for a test sentence (s of length k), we only put sentence and si is the number of sentences in the Sx consider examples EG = m=0 Ek±m, where x in- example-base that contain wi. Finding a set of can- dicates the window size. In our experiment, we con- didate sentences took only 0.3 seconds and 116 sec- sider the value of x from 0 to 2. We find the closest- onds, respectively, for 414 and 10,000 example in- match sc from EG for a given test sentence s. EG put sentences given 20k and 250k sentence example- has fewer sentences compared to E which will ef- base in our En–Tr and En–Fr experiment on a 3GHz fectively reduce the time of the matching procedure. Core 2 Duo machine with 4GB RAM. 4.2 Indexing 5 Experiments Our second approach to addressing time complexity is to use indexing. We index the complete example- We conduct different experiments to report the ac- 5 base using an open-source IR engine SMART and curacy of our EBMT systems for En–Tr and En–Fr retrieve a potential set of candidate sentences (likely translation tasks. In order to compare the perfor- 5An open source IR system from Cornell University. ftp: mance of our approaches we use two baseline sys- //ftp.cs.cornell.edu/pub/smart/ tems. We use the Moses SMT system as one base-

52 line. Furthermore, based on the matching step (Sec- ability of the EBMT system. tion 3.2) of the EBMT approach, we obtain the clos- est target-side equivalent (the skeleton sentence) and 5.1 Data Used for Experiments consider this as the baseline output for the input to We used two data sets for all our experiments rep- be translated. This is referred to as TM in the exper- resenting two language pairs of different size and iment below. We will consider this as the baseline type. In the first data-set, we have used the En–Tr accuracy for our EBMT using TM approach. corpus from IWSLT09.7 The training data consists In addition, we conduct two experiments with our of 19,972 parallel sentences. We used the IWSLT09 EBMT system. After obtaining the skeleton trans- development set as our testset which consists of 414 lation through the matching and alignment steps, in sentences. The IWSLT09 data set is comprised of the recombination step, we use TM to translate any short sentences (with an average of 9.5 words per unmatched segments based on Algorithm 1. We call sentence) from a particular domain (the C-STAR this EBMTTM. project’s Basic Travel Expression Corpus). We found that there are cases where the Our second data set consists of an En–Fr corpus from the European Medicines Agency EBMTTM system produces the correct translation 8 but SMT fails and vice-versa (Dandapat et al., 2011). (EMEA) (Tiedemann and Nygaard, 2009). The In order to further improve translation quality, we training data consists of 250,806 unique parallel sen- 9 use a combination of EBMT and SMT. Here we use tences. As a testset we use a set of 10,000 ran- some features to decide whether to rely on the out- domly drawn sentences disjoint from the training corpus. This data also represents a particular domain put produced by the EBMTTM system. These fea- tures include fuzzy match scoreFMS (as in (i)) and (medicine) but with longer sentence lengths (with an the number of mismatched segments in each of s, average of 18.8 words per sentence) compared to the 6 IWSLT09 data. sc, tc (EqUS as in (1)). We assume that the transla- tions of an input sentence s produced by EBMTTM 6 Results and Observations and SMT systems are respectively TEBMT(s) and TSMT(s). If the value of FMS is greater than some We used BLEU (Papineni et al., 2002) for automatic threshold and EqUS exists between s and sc, we evaluation of our EBMT systems. Table 2 shows rely on the output TEBMT(s); otherwise we take the the accuracy obtained for both En–Tr and En–Fr by output from TSMT(s). We refer to this system as the EBMTTM system described in Section 3. Here EBMTTM + SMT. we have two baseline systems (SMT and TM) as de- To test the scalability of the system, we con- scribed in the first two experiments in Section 5. ducted two more experiments based on the ap- proach described in Section 4. First, we con- Table 2: Baseline BLEU scores of the two systems ducted an experiment based on the sentence length- and the scores for EBMTTM system. based grouping heuristics (Section 4.1). We re- System Language pairs fer to this system as EBMTTM + SMT + groupi, where i indicates the window size while compar- En–Tr En–Fr ing the length of the input sentence with the bins. SMT 23.59 55.04 We conduct a second experiment based on the LM- TM 15.60 40.23 based indexing technique (Section 4.2) we have used EBMTTM 20.08 48.31 to retrieve a potential set of candidate sentences from the indexed example-base. We call this sys- Table 2 shows that EBMTTM has a lower system tem EBMTTM + SMT + index. Note that the accuracy than SMT for both the language pairs, but EBMT + SMT system is used as the baseline TM 7http://mastarpj.nict.go.jp/IWSLT2009/2009/12/downloads.html accuracy while conducting the experiments for scal- 8http://opus.lingfil.uu.se/EMEA.php 9A large number of duplicate sentences exists in the original 6 If s, sc and tc agree in the number of mismatched segments, corpus (approximately 1M sentences). We remove duplicates EqUS evaluates to 1, otherwise 0. and consider sentences with unique translation equivalents.

53 better scores than TM alone. Tables 3 and 4 show and EBMTTM + index) for both En–Tr and En–Fr that combining EBMT with SMT systems shows im- translation. Note that the SMT decoder takes 140 provements of 0.82 and 2.75 BLEU absolute over seconds and 310 minutes respectively for En–Tr and the SMT baseline (Table 2) for both the En–Tr and En–Fr translation test sets. the En–Fr data sets. In each case, the improvement EBMT + SMT of TM over the baseline SMT is sta- Table 5: Running time of the three different systems. tistically significant (reliability of 98%) using boot- strap resampling (Koehn, 2004). System Language pairs En–Tr En–Fr Table 3: En–Tr MT system accuracies of the com- (seconds)(minutes) bined systems (EBMTTM + SMT) with different SMT 140.0 310.0 combining factors. The second column indicates the EBMTTM 295.9 2267.0 number (and percentage) of sentences translated by EBMTTM + group0 34.0 63.4 the EBMTTM system during combination. EBMTTM + group1 96.2 183.5 System: EBMTTM + SMT EBMTTM + group2 148.5 301.4 Condition times BLEU EBMTTM + index 2.7 2.6 EBMTTM (in %) used Both the grouping and indexing methodologies FMS>0.85 35 (8.5%) 24.22 FMS>0.80 114 (27.5%) 23.99 proved successful for system scalability with a max- FMS>0.70 197 (47.6%) 22.74 imum speedup of almost 2 orders of magnitude. We FMS>0.80 OR 165 (40.0%) 23.87 also need to estimate the accuracy while combining (FMS>0.70 & EqUS) grouping and indexing techniques with the baseline FMS>0.85 & EqUS 24 (5.8%) 24.41 system (EBMTTM + SMT) to understand their rel- FMS>0.80 & EqUS 76 (18.4%) 24.19 ative performance. Table 6 provides the system ac- FMS>0.70 & EqUS 127 (30.7%) 24.08 curacy using the grouping and indexing techniques for both the language pairs. We report the transla- Table 4: En–Fr MT system accuracies for the com- tion quality under three conditions. Similar trends have been observed for other conditions. bined systems (EBMTTM + SMT) with different combining factors. 6.1 Observations and Discussions System: EBMTTM + SMT Condition times BLEU We find that the EBMTTM system has a lower ac- EBMTTM (in %) curacy on its own compared to baseline SMT for used both the language pairs (Table 2). Nevertheless, FMS>0.85 3323 (33.2%) 57.79 there are sentences which are better translated by the FMS>0.80 4300 (43.0%) 57.55 EBMTTM approach compared to SMT, although FMS>0.70 5283 (52.8%) 57.05 the overall document translation score is higher with FMS>0.60 6148 (61.5%) 56.25 SMT. Thus, we combined the two systems based on FMS>0.80 OR 4707 (47.1%) 57.46 (FMS>0.70 & EqUS) different features and found that the combined sys- FMS>0.85 & EqUS 2358 (23.6%) 57.24 tem performs better. The highest relative improve- FMS>0.80 & EqUS 2953 (29.5%) 57.16 ments in BLEU score are 3.47% and 1.05% respec- FMS>0.70 & EqUS 3360 (33.6%) 57.08 tively for En–Tr and En–Fr translation. We found that if an input has a high fuzzy match score (FMS) A particular objective of our work is to scale the with the example-base, then the EBMTTM system runtime EBMT system to a larger amount of train- does better compared to SMT. With our current ex- ing examples. We experiment with the two ap- perimental setup, we found that an FMS over 0.8 proaches described in Section 4 to improve the run showed an improvement for En–Tr and a FMS over time of the system. Table 5 compares the run time of 0.6 showed improvement for En–Fr over the SMT the three systems (EBMTTM, EBMTTM + groupi system. Figure 2 shows the effect in the translation

54 Table 6: BLEU scores of the three different systems for En–Tr and En–Fr under different conditions. i denotes the number of bins considered during grouping. Condition System

EBMTTM + SMT EBMTTM + SMT EBMTTM + SMT +groupi +index i=0 i=±1 i=±2 En–Tr FMS>0.85 24.22 24.18 24.18 24.23 24.24 FMS>0.80 OR (FMS>0.70 & EqUS) 23.87 23.34 23.90 24.40 24.37 FMS>0.85 & EqUS 24.41 24.17 24.38 24.34 24.39 En–Fr FMS>0.85 57.79 56.47 57.48 57.76 57.92 FMS>0.80 OR (FMS>0.70 & EqUS) 57.46 55.69 57.07 57.33 57.56 FMS>0.85 & EqUS 57.24 56.48 57.23 57.29 57.32 quality when different FMS thresholds were used to more prominent with the En–Fr system which uses combine the two systems. a larger example-base. We found a drop of abso- However, FMS might not be the only factor for lute 1.32 BLEU points while considering a single triggering the EBMTTM system. We considered bucket whose length is equal to the length of the EqUs as another factor which showed improvement test sentence. This configuration takes 63 minutes to for En–Tr but showed negative effect for En–Fr. translate 10k English sentences into French. There Though an FMS over 0.7 for En–Tr shows no im- is only a drop of 0.03 BLEU points when consider- provement in overall system accuracy, inclusion of ing the 5 nearest bins (±2) for a given test sentence. the EqUs feature along with FMS shows improve- Nevertheless, there is not much of a reduction but it ment. Thus, the EBMTTM system is sometimes increases the run time to 5 hours for the translation more effective when the number of unmatched seg- of 10k sentences. Thus, the group-based method is ment matches in s, sc and tc. not effective enough to balance system accuracy and These observations show the effective use of our run time. EBMT approach in terms of translation quality. Incorporation of the indexing technique into the However, we found that the EBMTTM system has matching stage of EBMT shows the highest effi- a very considerable runtime complexity. In order to ciency gains in run time. Translating 10k sen- translate 414 test sentences from English into Turk- tences from English into French takes only 158 sec- ish, the basic EBMT system takes 295.9 seconds. onds. It is also interesting to note that with index- The situation becomes worse when using the large ing, the BLEU score remained the same or even in- example-base for En–Fr translation. Here, we found creased. This is due to the fact that, compared to that the system takes around 38 hours to translate FMS-based matching, a different closest-matching 10k source English sentences into French. This is sentence sc is selected for some of the input sen- a significant time complexity by any standard for a tences while using indexing, thus resulting in a dif- runtime approach. However, both grouping and in- ferent outcome to the system. Figure 3 compares dexing reduce the time complexity of the approach the number of times the EBMTTM + SMT + index considerably. The time reduction with grouping de- system is used in the hybrid system and the num- pends on the number of bins considered to find the ber of same closest-matching sentences selected by closest sentence during the matching stage. Systems EBMTTM + SMT + index systems under different with a lower number of bins take less time but cause conditions for En–Tr. The use of index-based candi- more of a drop in translation quality. The effect is date selection for EBMT matching shows effective

55 57.79 24.22 23.59 55.04

BLEU (%) BLEU (%) 20.08 48.31 EBMTTM+SMT EBMTTM+SMT SMT SMT EBMTTM EBMTTM 0.2 0.4 0.6 0.8 0.9 1 0.3 0.4 0.6 0.8 0.9 1 Fuzzy Match Score Fuzzy Match Score (a) En—Fr (b) En—Tr

Figure 2: Effect of FMS in the combined EBMTTM + SMT system.

Table 7: The effect of indexing in selection sc and in final translation. Input: zeffix belongs to a group of medicines called antivirals. Ref: zeffix appartient a` une classe de medicaments´ appeles´ antiviraux. baseline EBMTTM system sc: simulect belongs to a group of medicines called immunosuppressants. st: simulect fait parti d ’ une classe de medicaments´ appeles´ immunosuppresseurs. Output: zeffix fait parti d ’ une classe de Figure 3: Number of times EBMTTM + SMT + index medicaments´ appeles´ antiviraux. used in the hybrid system and the number of times the same closest-matching sentences are selected by the EBMTTM + SMT + index system > > s : diacomit belongs to a group of medicines systems. a=FMS 0.85, b=FMS 0.85 & EqUS and c > > called antiepileptics. c=FMS 0.80 OR (FMS 0.70 & EqUS) st: diacomit appartient a` un groupe de medicaments´ appeles´ antiepileptiques´ . Output: zeffix appartient a` un groupe de obtained for the input sentence with the example- medicaments´ appeles´ antiviraux. base. Thus a feature-based combination of EBMT- and SMT-based systems produces better translation improvement in translation time, and BLEU scores quality than either of the individual systems. Inte- remained the same or increased. Due to the selec- gration of a SMT technology-based sub-sentential TM with the EBMT framework (EBMTTM) has im- tion of different closest-matching sentence sc, some- times the system produces better quality translation proved translation quality in our experiments. which increases the system level BLEU score. Ta- Our baseline EBMTTM system is a runtime ap- ble 7 shows one such En—Fr example where an proach which has high time complexity when us- index-based technique produced a better translation ing a large example-base. We found that the inte- than the baseline (EBMTTM + SMT) system. gration of IR-based indexing substantially improves 7 Conclusion run time without affecting BLEU score. So far our systems have been tested using moderately sized Our experiments show that EBMT approaches work example-bases from a closed domain corpus. In our better compared to the SMT-based system for cer- future work, we plan to use a much larger example- tain sentences when a high fuzzy match score is base and wider-domain corpora.

56 Acknowledgments ropean Association of Machine Translation, (EAMT 2010), [no page number], Saint-Raphael,¨ France. This research is supported by Science Foundation M. Khalilov, J.A.R. Fonollosa, I. Skadina, E. Bralitis Ireland (Grants 07/CE/I1142, Centre for Next Gen- and L. Pretkalnina. 2010. English–Latvian SMT: the eration Localisation). Challenge of Translating into a Free Word Order Lan- guage. In Proceedings of the 2nd International Work- shop on Spoken Language Technologies for Under- References resourced Languages (SLTU 2010), [no page num- S. Armstrong, C. Caffrey, M. Flanagan, D. Kenny, M. ber], Saint-Raphael,¨ France. O’Hagan and A. Way. 2006. Improving the Quality P. Koehn. 2010. Statistical Machine Translation, Cam- of Automated DVD Subtitles via Example-Based Ma- bridge University Press, Cambridge, UK. chine Translation. Translating and the Computer 28, P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. [no page number], London: Aslib, UK. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, E. Bic¸ici and M. Dymetman. 2008. Dynamic Translation R. Zens, C. Dyer, O. Bojar, A. Constantin and E. Memory: Using Statistical Machine Translation to Im- Herbst. 2007. Moses: open source toolkit for statisti- prove Translation Memory. In Gelbukh, Alexander F., cal machine translation. In Proceedings of the Demon- editor, In Proceedings of the 9th International Confer- stration and Poster Sessions at the 45th Annual Meet- ence on Intelligent Text Processing and Computational ing of the Association of Computational Linguistics Linguistics (CICLing), volume 4919 of Lecture Notes (ACL 2007), pp. 177-180. Prague, Czech Republic. in Computer Science, pp 3-57 Springer Verlag. P. Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the Confer- J. Bourdaillet, S. Huet, F. Gotti, G. Lapalme and P. ence on Empirical Methods in Natural Language Pro- Langlais. 2009. Enhancing the bilingual concor- cessing (EMNLP 2004), pp. 388-395. Barcelona, dancer TransSearch with word-level alignment. In Spain. Proceedings, volume 5549 of Lecture Notes in Artifi- P. Koehn and J. Senellart. 2004. Convergence of Trans- cial Intelligence: 22nd Canadian Conference on Ar- lation Memory and Statistical Machine Translation. In tificial of Intelligence (Canadian AI 2009), Springer- Proceedings of the AMTA workshop on MT Research Verlag, pp. 27-38. and the Translation Industry, pp. 21-23. Denever, CO. R. D. Brown. 2011. The CMU-EBMT machine transla- P. Langlais, G. Lapalme and M. Loranger. 2002. tion system. Machine Translation, 25(2):179–195. Development-evaluation cycles to boost translator’s I. Cicekli and H. A. Guvenir.¨ 2001. Learning trans- productivity. Machine Translation, 15(4):77–98. lation templates from bilingual translation examples. Y. Lepage and E. Denoual. 2005. Purest ever example- Applied Intelligence, 15(1):57–76. based machine translation: Detailed presentation and S. Dandapat, S. Morrissey, A. Way and M.L. Forcada. assessment. Machine Translation, 19(3-4):251–282. 2011. Using Example-Based MT to Support Sta- V. I. Levenshtein. 1965. Binary Codes Capable of Cor- tistical MT when Translating Homogeneous Data in recting Deletions, Insertions, and Reversals. Dok- Resource-Poor Settings. In Proceedings of the 15th lady Akademii Nauk SSSR, 163(4):845-848., English Annual Meeting of the European Association for Ma- translation in Soviet Physics Doklady,10(8), 707-710. chine Translation (EAMT 2011), pp. 201-208. Leu- C. D. Manning, P. Raghavan and H. Schutze.¨ 2008. In- ven, Belgium. troduction to Information Retrieval. Cambridge Uni- S. Dandapat, M.L. Forcada, D. Groves, S. Penkale, versity Press, Cambridge, UK. J. Tinsley and A. Way. 2010. OpenMaTrEx: M. Nagao. 1984. A Framework of a Machine Translation a free/open-source marker-driven example-based ma- between Japanese and English by Analogy Principle. chine translation system. In Proceedings of the 7th In- In Elithorn, A. and Banerji, R., editors, Artificial Hu- ternational Conference on Natural Language Process- man Intelligence, pp. 173–180, North-Holland, Ams- ing (IceTAL 2010), pp. 121-126. Reykjav´ık, Iceland. terdam. D. Groves and A. Way. 2006. Hybridity in MT: Exper- F. Och and H. Ney. 2003. A Systematic Comparison of iments on the Europarl Corpus. In Proceedings of the Various Statistical Alignment Models. Computational 11th Conference of the European Association for Ma- Linguistics, 29(1):19–51. chine Translation (EAMT 2006), pp. 115-124. Oslo, K. Papineni, S. Roukos, T. Ward and W. J. Zhu. 2002. Norway. BLEU: a Method for Automatic Evaluation of Ma- M. Islam, J. Tiedemann and A. Eisele. 2010. English– chine Translation. In Proceedings of the 40th Annual Bengali Phrase-based Machine Translation. In Pro- Meeting of the Association for Computational Linguis- ceedings of the 14th Annual Conference of the Eu- tics, (ACL 2002), pp. 311–318, Philadelphia, PA.

57 A. B. Phillips. 2011. Cunei: open-source machine trans- lation with relevance-based models of each translation instance. Machine Translation, 25(2):161–177. M. Simard and P. Isabelle. 2009. Phrase-based Machine Translation in a Computer-assisted Translation Mem- ory. In Proceedings of the 12th Machine Translation Summit, (MT Summit XII), pp. 120–127, Ottawa, Canada. M. Simard. 2003. Translation spotting for translation memories. In Proceedings of the HLT-NAACL 2003, Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, pp. 65–72, Edmonton, Canada. H. Somers. 2003. An Overview of EBMT. In M. Carl and A. Way , editors, Recent Advances in Example- based Machine Translation, pp. 3-57, Kluwer Aca- demic Publishers, Dordrecht, The Netherlands. J. Tiedemann and L. Nygaard. 2009. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces, in N. Nicolov, K. Bontcheva, G. Angelova, R. Mitkov. (eds.), Recent Advances in Natural Language Processing, V:237–248, John Ben- jamins, Amsterdam, The Netherlands. E. Ukkonen. 1983. On Approximate String Matching. In Proceedings of International Conference on Founda- tions of Computing Theory, (FCT 1983), pp. 487–496, Borgholm, Sweden. R. Wagner and M. Fischer. 1974. The String-to-String Correction Problem. Journal of the Association for Computing Machinery, 21:168–173.

58 Two approaches for integrating translation and retrieval in real applications

Cristina Vertan University of Hamburg Research Group “Computerphilology” Von-Mell Park 6, 20146 Hamburg, germany [email protected]

quoted above. In this paper we will describe two applications and two different ways of combining Abstract term-translation and information retrieval. In the first one, an eLearning system, we implement a In this paper we present two approaches for language ontology on which we map the integrating translation into cross-lingual multilingual lexical entries. The search engine search engines: the first approach relies on makes then use of the mapping between the term translation via a language ontology, lexical material and the ontology. The second the other one is based on machine application is a content management system, in translation of specific information. which we use machine translation as backbone to the search engine 1 Introduction

The explosion of on-line available multilingual The rest of the paper is organised as follows: in information during the last years, raised the Section 2 we describe the eLearning environment necessity of building applications able to in which we embedded the search engine and manage this type of content. People are more present this one. In Section 3 we describe the and more used to search for information not only content management system and the symbiosis in English, but also in their mother tongue and between the machine translation and the search often in some other languages they understand. engines. In Section 4 we conclude with some Moreover there are dedicated web-platforms observations on these two approaches and where the information is a-priori multilingual, introduce possible approaches for further work. like eLearning Systems and Content Management Systems. eLearning systems are used more an more as real alternatives to face- 2 Crosslingual search based on language to-face courses and include often materials in the independent ontology and lexical mother languages and also English (either information because a lot of literature is available in English or because the content should be made available to exchange students). Content management The system we describe in this section was systems used by multinational corporates, share developed within the EU-Project LT4eL – materials in several languages as well. Language Technology for eLearning (http://www.lt4el.eu). The main goal of the project On such platforms the search facility is an was to enhance an eLearning system with essential one: usually the implemented methods language technology tools. The system dealt with are based on term indexes, which are created per nine languages (Bulgarian, Czech, Dutch, English, language. This prohibits or at least makes very German, Maltese, Polish, Portuguese, Romanian). difficult the access to multilingual material: the eLearning documents were processed through user is forced to repeat the query in several language specific pipelines and keywords and languages, which is time consuming and error – definitions were automatic extracted. The kernel prone. of the system is however the crosslingual semantic Cross-lingual retrieval methods are only slowly search engine which makes use of a language introduced in real applications like those ones

59 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 59–64, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics independent ontology and mapping of language of users (producer, retailer, etc). With respect to specific lexicons. the relations between the terms in the lexicon and the concepts in the ontology, there are two main As prototype we implemented a domain specific problems: (1) there is no lexicalized term for some ontology of about 1000 concepts, from the field of the concepts in the ontology, and (2) there are „Computer Science for non computer science lexical terms in the language of the domain which specialists“. The concepts were not collected lack corresponding concepts in the ontology, from English texts, but from analyzed keywords which represent the meaning of the terms. The from al involved languages. In this way we first problem is overcomed by writing down in the avoided a bias towards English specific lexicon also non-lexicalized (fully compositional) concepts. For the keywords in each language phrases to be represented. Even more, we each partner provided an English translation encourage the lexicon builders to add more terms (one word, one expression or even several and phrases to the lexicons for a given concept in sentences). The analysis of these translations order to represent as many ways of expressing the conducted to the construction of the ontology. concept in the language as possible. The concepts were represented in OWL-DL. The domain specific ontology was mapped on the These different phrases or terms for a given DOLCE-upper ontology as well as WordNeT to concept are used as a basis for construction of the ensure consistency. annotation grammar. Having them, we might capture different wordings of the same meaning in The two main components that define the the text. The concepts are language independent ontology-to-text relation necessary to support and they might be represented within a natural the crosslingual retrieval are: (terminological) language as form(s) of a lexicalized term, or as a lexicon and concept annotation grammar free phrase. In general, a concept might have a (Lemnitzer et. Al, 2007). few terms connected to it and a (potentially) unlimited number of free phrases expressing this The lexicon plays twofold role in the concept in the language architecture. First, it interrelates the concepts in the ontology to the lexical knowledge used by Some of the free phrases receive their meaning the grammar in order to recognize the role of the compositionally regardless their usage in the text, concepts in the text. Second, the lexicon other free phrases denote the corresponding represents the main interface between the user concept only in a particular context. In our and the ontology. This interface allows for the lexicons we decided to register as many free ontology to be navigated or represented in a phrases as possible in order to have better recall natural way for the user. For example, the on the semantic annotation task. concepts and relations might be named with terms used by the users in their everyday In case of a concept that is not-lexicalized in a activities and in their own natural language (e.g. given language we require at least one free phrase Bulgarian). This could be considered as a first to be provided for this concept. We could step to a contextualized usage of the ontology in summarize the connection between the ontology a sense that the ontology could be viewed and the lexicons in the following way: the through different terms depending on the ontology represents the semantic knowledge in context. For example, the color names will vary form of concepts and relations with appropriate from very specific terms within the domain of axioms; and the lexicons represent the ways in carpet production to more common names used which these concepts can be realized in texts in when the same carpet is part of an interior the corresponding languages. design. Of course, the ways in which a concept could be Thus, the lexical items contain the following represented in the text are potentially infinite in information: a term, contextual information number, thus, we could hope to represent in our determining the context of the term usage, lexicons only the most frequent and important grammatical features determining the syntactic terms and phrases. Here is an example of an entry realization within the text. In the current from the Dutch lexicon: implementation of the lexicons the contextual information is simplified to a list of a few types

60 concept annotation grammar, is ideally considered annotation task. Minimally, the concept annotation grammar consists of a chunk grammar rules. The chunk grammar for each term in the lexicon contains at least one grammar rule for recognition of the term. A horizontal or vertical bar as a part of a As a preprocessing step we consider annotation window, that contains with grammatical features and lemmatization of buttons, icons. the text. The disambiguation rules exploit the local context in terms of grammatical features, werkbalk alsp the global context such as topic of the text, balk discourse segmentation, etc. Currently we have balk met implemented chunk grammars for several knoppen languages. menubalk The disambiguation rules are under development. For the implementation of the annotation grammar we rely on the grammar facilities of the CLaRK Each entry of the lexicons contains three types System (Simov et al., 2001). The structure of each of information: (1) information about the grammar rule in CLaRK is defined by the concept from the ontology which represents the following DTD fragment: meaning for the terms in the entry; (2) explanation of the concept meaning in English; have the meaning expressed by the concept. The concept part of the entry provides minimum information for formal definition of the concept. The English explanation of the concept meaning facilitates the human understanding. The set of terms stands for different wordings of the Each rule is represented as a line element. The concept in the corresponding language. One of rule consists of regular expression (RE) and the terms is the representative for the term set. category (RM = return markup). The regular Note that this is a somewhat arbitrary decision, expression is evaluated over the content of a given which might depend on frequency of term usage XML element and could recognize tokens and/or or specialist’s intuition. This representative term annotated data. The return markup is represented will be used where just one of terms from the set as an XML fragment which is substituted for the is necessary to be used, for example as an item recognized part of the content of the element. of a menu. In the example above we present the set of Dutch terms for the concept Additionally, the user could use regular lt4el:BarWithButtons. expressions to restrict the context in which the regular expression is evaluated successfully. The One of the term is non-lexicalized - attribute LC element contains a regular expression for the type with value nonlex. The first term is left context and the RC for the right one. The representative for the term set and it is marked- element Comment is for human use. The up with attribute shead with value 1. In this way application of the grammar is governed by Xpath we determine which term to be used for expressions which provide additional mechanism ontology browsing if there is no contextual for accurate annotation of a given XML information for the type of users. The second document. component of the ontology-to-text relation, the

61 Thus, the CLaRK grammar is a good choice for word was ambiguous, the intended meaning can implementation of the initial annotation be explicated now by selecting the appropriate grammar. The creation of the actual annotation concept. Furthermore, by clicking on a concept, grammars started with the terms in the lexicons related concepts are displayed; navigation through for the corresponding languages. Each term was the ontology is possible in this way. A list of lemmatized and the lemmatized form of the term retrieval languages (only LOs written in one of was converted into regular expression of those languages will be found) is specified as an grammar rules. Each concept related to the term input parameter. The retrieved LOs are sorted by is stored in the return markup of the language. The next ordering criterion is a ranking, corresponding rule. Thus, if a term is based on the number of different search concepts ambiguous, then the corresponding rule in the and the number of occurrences of those concepts grammar contains reference to all concepts in the LO. For each found LO, its title, language, related to the term. and matching concepts are shown.

The relations between the different elements of the models are as follows. A lexical item could 3 Crosslingual search based on machine have more than one grammar rule associated to Translation it depending on the word order and the grammatical realization of the lexical item. Two lexical items could share a grammar rule if they The second case study is the embedding of a have the same wording, but they are connected crosslingual search engine into a web-based to different concepts in the ontology. Each content management system. The system is grammar rule could recognize zero or several currently implemented within the EU-PSP project text chunks. ATLAS (http://www.atlasproject .eu) and aims to be domain independent. Thus, a model as The relation ontology-to-text implemented in presented in section 2 is impossible to be realised, this way is the basis fort the crosslingual search as the automatic construction of a domain engine which works in the following way: ontology is too unreliable and the human Words in any of the covered languages can be construction too cost effective. Also a general entered and are looked up in the lexicon; the lexicon coverage is practically impossible. concepts that are linked to the matching lexicon entries are used for ontology-based search in an Therefore in this project we adopted a different automatic fashion. solution (Karagiozov et al 2011), namely we ensure the translation of keywords and short Before lexicon lookup, the words are generated abstracts, and all these translations are orthographically normalised, and combinations part of the RDF-generate index. The ATLAS for multi-word terms are created (e.g. if the system ensures the lingustic processing of words "text" and "editor" are entered, the uploaded documents and extraction of most combinations "texteditor", "text editor" and important keywords. A separate module generates "text-editor" are created and looked up, in short abstracts. These two elements can be further addition to the individual words ). For each of submitted for translation. the found concepts, the set of all its (direct or indirect) subconcepts is determined, and is used For the MT-Engine of the ATLAS –System on a to retrieve Learning Objects (Los) . hybrid architecture combining example (EBMT) and statistical (SMT) machine translation on The use of these language-independent concepts surface forms (no syntactic as an intermediate step makes it possible to trees will be used) is chosen. For the SMT- retrieve LOs in any of the covered languages, component PoS and domain factored models as in thus realising the crosslingual aspect of the (Niehues and Waibel 2010) are used, in order to retrieval. When the found LOs are displayed, at ensure domain adaptability. An original approach the same time the relevant parts of the ontology of our system is the interaction of the MT-engine are presented in the language that the user with other modules of the system: prefers. In a second step, the user can select (by marking a checkbox) the concept(s) he wants to The document categorization module assigns to look for and repeat the search. If an entered each document one or more domains. For each

62 domain the system administrator has the This ist he basis for creation of the RDF-Index. possibility to store information regarding the The crosslingual serch engine is in this case a availability of a correspondent specific training classic Lucene search engine, which operates corpus. If no specific trained model for the however not with word-indexes but with these respective domain exists, the user is provided RDF-indexes, which automatically include with a warning, telling that the translation may multilingual information. This engine is currently be inadequate with respect to the lexical under construction. coverage. 4 Conclusions The output of the summarization module is processed in such way that ellipses and anaphora In this paper we presented two approaches of are omitted, and lexical material is adapted to embedding multilingual information into search the training corpus. engines.

The information extraction module is providing One is based on the construction of a language information about metadata of the document independent ontology and corresponding lexical including publication age. For documents material, the other one on machine translation. previous to 1900 we will not provide translation, explaining the user that in absence of a training The first approach relies on a manual constructed corpus the translation may be misleading. The ontology, therefore it is highly domain dependent domain and dating restrictions can be changed at and requires the involvement of domain any time by the system administrator when an specialists. adequate training model is provides. The second approach relies on machine translation The translation results are then embedded in a quality, and also lacks a deep semantic analysis of document model which is used further for the query. crosslingual search. Each document is thus converted to the However the mechanism can be implemented following format completely automatically, and is domain independent (assuming that the machine Internet Ethics Therefore it is difficult to asses which approach performs better. Further work concerns the 5 Acknowledgements Default english summary The work described here was realized within two ATLAS (http://www.atlasproject.eu). Deutsche Zusammenfassung The author is indebted especially to following persons who contributed essentially for the and Petya Osenova (Bulgarian Academy of Name Sciences), Eelco Mosel (former at the university [email protected] of Hamburg), Ronald Winnemöller (University of Hamburg).

63 References

Karagiozov, D. Koeva,S. Ogrodniczuk, M. and Vertan, C. ATLAS — A Robust Multilingual Platform for the Web. In Proceedings of the German Society for Computational Linguistics and Language Technology Conference (GSCL 2011), Hamburg, Germany, 2011

Lemnitzer, L. and Vertan, C. and Simov, K. and Monachesi, P. and Kiling A. and Cristea D. and Evans, D., „Improving the search for learning objects with keywords and ontologies“. In Proceedings of Technologically enhanced learning conference 2007, p. 202-216

Niehues J. and Waibel,A. Domain Adaptation in Statistical Machine Translation using Factored Translation Models, Proceedings of EAMT 2010 Saint-Raphael, 2010

Simov, K., Peev, Z., Kouylekov, M., Simov, A., Dimitrov, M., Kiryakov, A. (2001). CLaRK - an XML-based System for Corpora Development. In: Proc. of the Corpus Linguistics 2001 Conference. Lancaster, UK.

64 PRESEMT: Pattern Recognition-based Statistically Enhanced MT

George Tambouratzis, Marina Vassiliou, Sokratis Sofianopoulos Institute for Language and Speech Processing, Athena R.C. 6 Artemidos & Epidavrou Str., Paradissos Amaroussiou, 151 25, Athens, Greece. {giorg_t; mvas ; s_sofian}@ilsp.gr

1. Pre-processing stage: This is the stage where Abstract the essential resources for the MT system are compiled. It consists of four discrete modules: (a) This document contains a brief presentation of the PRESEMT project that aims in the de- the Corpus creation & annotation module , velopment of a novel language-independent being responsible for the compilation of mono- methodology for the creation of a flexible and lingual and bilingual corpora over the web and adaptable MT system. their annotation; (b) the Phrase aligner module , which processes a bilingual corpus to perform phrasal level alignment within a language pair; (c) 1. Introduction the Phrasing model generator that elicits an SL phrasing model on the basis of the aforemen- The PRESEMT project constitutes a novel ap- tioned alignment and employs it as a parsing tool proach to the machine translation task. This ap- during the translation process; (d) the Corpus proach is characterised by (a) introducing cross- modelling module , which creates semantics- disciplinary techniques, mainly borrowed from based TL models used for disambiguation pur- the machine learning and computational intelli- poses during the translation process. gence domains, in the MT paradigm and (b) us- 2. Main translation engine: The translation in ing relatively inexpensive language resources. PRESEMT is a top-down two-phase process, The aim is to develop a language-independent distinguished into the Structure selection mod- methodology for the creation of a flexible and ule , where the constituent phrases of an SL sen- adaptable MT system, the features of which en- tence are reordered according to the TL, and the sure easy portability to new language pairs or Translation equivalent selection module where adaptability to particular user requirements and translation disambiguation is resolved and word to specialised domains with minimal effort. order within phrases is established. Closely inte- PRESEMT falls within the Corpus-based MT grated to the translation engine, but not part of (CBMT) paradigm, using a small bilingual paral- the main translation process, is the Optimisation lel corpus and a large TL monolingual corpus. module, which is responsible for automatically Both these resources are collected as far as pos- improving the performance of the two translation sible over the web, to simplify the development phases by fine-tuning the values of the various of resources for new language pairs. system parameters. The main aim of PRESEMT has been to alle- 3. Post-processing stage: The third stage is user- viate the reliance on specialised resources. In oriented and comprises (i) the Post-processing comparison, Statistical MT requires large parallel and (ii) the User Adaptation modules. The first corpora for the source and target languages. module allows the user to modify the system- PRESEMT relaxes this requirement by using a generated translations towards their requirements. small parallel corpus, augmented by a large TL The second module enables PRESEMT to adapt monolingual corpus. to this input so that it learns to generate transla- 2. PRESEMT system structure tions closer to the users’ requirements. The post- processing stage represents work in progress to The PRESEMT system is distinguished into be reported in future publications, the present three stages, as shown in Figure 1: article focussing on the actual strategy for gener- ating the translation.

65 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 65–68, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics 3. Processing of the bilingual corpus ambiguation purposes; thus another form of in- terfacing with the monolingual corpus is essen- The bilingual corpus contains literal translations, tial for the word reordering task within each to allow the extrapolation of mapping informa- phrase. The size of the data accessed is very tion from SL to TL, though this may affect the large. Typically, a monolingual corpus contains 3 translation quality. The Phrase aligner module billion words, 10 8 sentences and approximately (PAM) performs offline SL – TL word and 10 9 phrases. Since the models for the TL phrases phrase alignment within this corpus. PAM serves need to be accessed in real-time to allow word as a language-independent method for mapping reordering within each phrase, the module uses corresponding terms within a language pair, by the phrase indexed representation of the mono- circumventing the problem of achieving com- lingual corpus. This phrase index is created patibility between the outputs of two different based on four criteria: (i) phrase type, (ii) phrase parsers, one for the SL and one for the TL. PAM head lemma, (iii) phrase head PoS tag and (iv) relies on a single parser for the one language and number of tokens in the phrase. generates an appropriate phrasing model for the other language in an automated manner. Indexing is performed by extracting all The phrases are assumed to be flat and linguisti- phrases from the monolingual corpus, each of cally valid. As a parser, any available tool may which is transformed to the java object instance be used (the TreeTagger (Schmid, 1994) is used used within the PRESEMT system. The phrases in the present implementation for English). PAM are then organised in a hash map that allows mul- processes a bilingual corpus of SL – TL sentence tiple values for each key, using as a key the 4 pairs, taking into account the parsing information aforementioned criteria. Statistical information in one language (in the current implementation about the number of occurrences of each phrase the TL side) and making use of a bilingual lexi- in the corpus is also included. Finally, each map con and information on potential phrase heads; is serialised and stored in the appropriate file in the output being the bilingual corpus aligned at the PRESEMT path, with each file being given a word, phrase and clause level. Thus, at a phrasal suitable name for easy retrieval. For example, for level, the PAM output indicates how an SL struc- the English monolingual corpus, all verb phrases ture is transformed into the TL. For instance, with head lemma “ read ” (verb) and PoS tag based on a sentence pair from the parallel corpus, “VV ” containing 2 tokens in total are stored in the SL sentence with structure A-B-C-D is trans- the file “ Corpora\EN\Phrases\VC\read_VV ”. If formed into A’-C’-D’-B’, where X is a phrase in any of these criteria has a different value, then a SL and X’ is a phrase in TL. Further PAM details separate file is created (for instance for verb are reported in Tambouratzis et al. (2011). phrases with head “ read ” that contain 3 tokens). The PAM output in terms of SL phrases is then handed over to the Phrasing model genera- 5. Main translation engine tor (PMG), which is trained to determine the The PRESEMT translation process entails first phrasal structure of an input sentence. PMG the establishment of the sentence phrasal struc- reads the SL phrasing as defined by PAM and ture and then the resolution of the intra-phrasal generates an SL phrasing model using a probabil- arrangements, i.e. specifying the correct word istic methodology. This phrasing model is then order and deciding upon the appropriate candi- applied in segmenting any arbitrary SL text being date translation. Both phases involve searching input to the PRESEMT system for translation. for suitable matching patterns at two different PMG is based on the Conditional Random Fields levels of granularity, the first (coarse-grained) model (Lafferty et al., 1999) which has been aiming at defining a TL-compatible ordering of found to provide the highest accuracy. The SL phrases in the sentence and the second (fine- text segmented into phrases by PMG is then in- st grained) determining the internal structure of put to the 1 translation phase. For a new lan- phrases. While the first phase utilises the small guage pair, the PAM-PMG chain is implemented bilingual corpus, the second phase makes use of without any manual correction of outputs. the large monolingual corpus. To reduce the translation time required, both corpora are proc- 4. Organising the monolingual corpus essed in advance and the processed resources are The language models created by the Corpus stored in such a form as be retrieved as rapidly as modelling module can only serve translation dis- possible during translation.

66 5.1 Translation Phase 1: Structure selection information such as word-reordering. A match- module ing algorithm selects the most similar from the set of the retrieved TL phrases through a com- Each SL sentence input for translation is tagged parison process, which is viewed as an assign- and lemmatised and then it is segmented into ment problem, using the Gale-Shapley algorithm phrases by the Phrasing model generator on the (Gale and Shapley, 1962). basis of the SL phrasing model previously cre- ated. For establishing the correct phrase order 6. Experiments & evaluation results according to the TL, the parallel corpus needs to be pre-processed using the Phrase aligner module To date MT systems based on the PRESEMT to identify word and phrase alignments between methodology have been created for a total of 8 the equivalent SL and TL sentences. languages, indicating the flexibility of the pro- During structure selection, the SL sentence is posed approach. Table 1 illustrates an indicative aligned to each SL sentence of the parallel cor- set of results obtained by running automatic pus, as processed by the PAM and assigned a evaluation metrics on test data translated by the similarity score using an algorithm from the dy- 1st PRESEMT prototype for a selection of lan- namic programming paradigm. The similarity guage pairs, due to space restrictions. score is calculated by taking into account edit In the case of the language pair English-to- operations (replacement, insertion or removal) German, these results are contrasted to the ones needed to be performed in the input sentence in obtained when translating the same test set with order to transform it to the corpus SL sentence. Moses (Koehn et al., 2007).It is observed that for Each of these operations has an associated cost, the English-to- pair, PRESEMT considered as a system parameter. The aligned achieved approximately 50% of the MOSES corpus sentence that achieves the highest similar- BLEU score and 80% of the MOSES with re- ity score is the most similar one to the input spect to the Meteor and TER scores. These are source sentence. This comparison process relies reasonably competitive results compared to an on a set of similarity parameters (e.g. phrase type, established system such as Moses. Furthermore, phrase head etc.), the values of which are opti- it should taken into consideration that (a) the mised by employing the optimisation module. PRESEMT results were obtained by the 1 st sys- The implementation is based on the Smith- tem prototype, (b) PRESEMT is still under de- Waterman algorithm (Smith and Waterman, velopment and (c) only one reference translation 1981), initially proposed for determining similar was used per sentence. regions between two protein or DNA sequences. Newer versions of the PRESEMT system, in- The algorithm is guaranteed to find the optimal corporating more advanced versions of the dif- local alignment between the two input sequences ferent modules are expected to result in substan- at clause level. tially improved translation accuracies. In particu- lar, the second translation phase will be further 5.2 Translation Phase 2: Translation researched. In addition, experiments have indi- equivalent selection module cated that the language modelling module can After establishing the order of phrases within provide additional improvement in the perform- each sentence, the second phase of the translation ance. Finally, refinements in PAM and PMG process is initiated, comprising two distinct may lead in increased translation accuracies. tasks. The first task is to resolve the lexical am- biguity, by picking one lemma from each set of 7. Links possible translations (as provided by a bilingual Find out more about the project on the PRE- dictionary). In doing so, this module makes use SEMT website : www.presemt.eu . Also, the of the semantic similarities between words which PRESEMT prototype may be tried at: have been determined by the Corpus Modelling presemt.cslab.ece.ntua.gr:8080/presemt_interface _test module through a co-occurrence analysis on the monolingual TL corpus. That way, the best com- Acknowledgments bination of lemmas from the sets of candidate translations is determined for a given context. The research leading to these results has received In the second task, the most similar phrases to funding from the European Community's Sev- the TL structure phrases are retrieved from the enth Framework Programme (FP7/2007-2013) monolingual corpus to provide local structural under grant agreement n° 248307.

67 References Munkres J. 1957. Algorithms for the assignment and transportation problems. Journal of the Society for Gale D. and L. S. Shapley. 1962. College Admissions Industrial and Applied Mathematics, Vol. 5, pp.32- and the Stability of Marriage. American Mathe- 38. matical Monthly, Vol. 69, pp. 9-14. Schmid, H. 1994. Probabilistic Part-of-Speech Tag- Koehn P., H. Hoang, A. Birch, C. Callison-Burch, M. ging Using Decision Trees, Proceedings of Interna- Federico, N. Bertoldi, B. Cowan, W. Shen, C. tional Conference on New Methods in Language Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, Processing, Manchester, UK. and E. Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings Smith T. F. and M. S. Waterman. 1981. Identification of the ACL-2007 Demo and Poster Sessions. of Common Molecular Subsequences. Journal of Molecular Biology, 147: 195–197. Kuhn H. W. 1955. The Hungarian method for the as- signment problem. Naval Research Logistics Quar- Tambouratzis G., F. Simistira, S. Sofianopoulos, N. terly, Vol. 2, pp.83-97. Tsimboukakis and M. Vassiliou 2011. A resource- light phrase scheme for language-portable MT, Lafferty J., A. McCallum, F. Pereira. 2001. Condi- Proceedings of the 15 th International Conference of tional Random Fields: Probabilistic Models for the European Association for Machine Translation, Segmenting and Labelling Sequence Data. Pro- 30-31 May 2011, Leuven, Belgium, pp. 185-192. ceedings of ICML Conference, pp.282-289.

Table 1 – PRESEMT Evaluation results for different language pairs. Language Pair Sentence set Metrics SL TL Number Source BLEU NIST Meteor TER English German 189 web 0.1052 3.8433 0.1939 83.233 German English 195 web 0.1305 4.5401 0.2058 74.804 Greek English 200 web 0.1011 4.5124 0.2442 79.750

English German 189 web 0.2108 5.6517 0.2497 68.190 Moses

Figure 1 – PRESEMT system architecture.

WEB

Offline corpora Corpus creation creation & annotation module TL Monolingual Corpus & Corpus model Corpus mode lling module Bilingual parallel corpus Phrase aligner module Bilingual Post -processing Lexicon & User adaptation Phrasing model Bilingual aligned modules generator corpus

Phase 1: Structure sele ction Phase 2: Translation equivalent selection Tagging - Phrase reordering of each Word reordering & Final SL text Lemmatising, Lexicon look-up & sentence in the SL text disambiguation translation Token generation

Optimisation module

68 PLUTO: Automated Solutions for Patent Translationi

John Tinsley, Alexandru Ceausu, Jian Zhang

Centre for Next Generation Localisation School of Computing Dublin City University, Ireland {jtinsley;aceausu;jzhang}@computing.dcu.ie

to expensive human translation, the quality is 1 Introduction still often inadequate as the models are too gen- eral to cope with the intricacies of patent text. PLUTO is a commercial development project In what follows, we will provide an overview supported by the European Commission as part of some of the technologies being developed in of the FP7 programme which aims to eliminate PLUTO to address the need for higher quality the language barriers that exist worldwide in the MT solutions for patents and how these are de- provision of multilingual access to patent infor- ployed for the benefit of IP professionals. mation. The project consortium comprises four partners: the Centre for Next Generation Locali- 2 Language Technology for Patents sation at Dublin City University,1 ESTeam AB,2 CrossLang, 3 and the Dutch Patent Information Patent translation is a unique task given the User Group (WON).4 Research and development style of language used in patent documents. This is carried out in close collaboration with user language, so-called “patentese”, typically com- groups and intellectual property (IP) profession- prises a mixture of highly-specific technical ter- als to ensure solutions and software are delivered minology and legal jargon and is often written that meet actual user needs. with the express purpose of obfuscating the in- tended meaning. For example, in 2001 an inno- 1.1 The need for patent translation vation was granted in Australia for a “Circular 5 The number of patent applications filed Transportation Facilitation Device”, i.e. a wheel. worldwide is continually increasing, with over Patents are also characterised by a prolifera- 1.8 million new filings in 2010 alone. Yet De- tion of extremely long sentences, complex chem- spite the fact that patents are filed in dozens of ical formula, and other constructs which make different languages, language barriers are no ex- the task for MT more difficult. cuse in the case of infringement. When carrying 2.1 Domain-specfic machine translation our prior-art and other searches IP professionals must ensure they include collections which en- The patent translation systems used in compass all potential relevant patents. Such PLUTO have been built using the MaTrEx MT searches will typically return results – a set of framework (Armstrong et al., 2006). The systems patent documents – 30% of which will be in a are domain specific in that they have been foreign language. trained exclusively using parallel patent corpora. As professional translation for patents is such A number of experiments related to domain ad- a specialist task, translators command a premium aptation of the language and translation models fee for this service, often up to €0.50 per word have been carried out in the context of these sys- for Asian languages. This often results in high or tems. The principal findings from this work were unworkable translation costs for innovators. that systems combining all available patent data While free machine translation (MT) tools such for a given language were preferable (Ceausu et as Google translate have unquestionably been al. 2011). beneficial in helping to reduce the need to resort

1 www.cngl.ie 2 www.esteam.se 5 3 www.crosslang.com http://pericles.ipaustralia.gov.au/aub/pdf/nps/2002/0808/200 4 www.won-nl.com 1100012A4/2001100012.pdf

69 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 69–71, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics Significant pre-processing techniques are also have been written about previously, there is often applied to the input text to account for specific little repetition of full sentences. features of patent language. For instance, sen- tence splitting based on the marker hypothesis 2.3 Evaluation (Green, 1979) is used to reduce long sentences to The performance of the patent MT systems in more manageable lengths, while named-entity PLUTO is evaluated using a range of methods recognition is applied to isolate certain struc- aimed not only at gauging general quality, but tures, such as chemical compounds and refer- also identifying areas for improvement and rela- ences to figures, in order to treat them in a tive performance against similar systems. specific manner. In addition to assessing the MT systems using Additionally, various language-specific tech- automatic evaluation metrics such as BLEU niques are used for relevant MT systems. For (Papineni et al., 2002) and METEOR (Banerjee example, a technique called word packing (Ma et et al. 2005), large-scale human evaluations are al., 2007), is exploited for Chinese—English. also carried out. MT system output is ranked This is a bilingually motivated task which im- from 1—5 based on the overall quality of transla- proves the precision of word alignment by “pack- tion, and individual translation errors are identi- ing” several consecutive words together which fied and classified in an error categorisation task. correspond to a single word in the corresponding On top of this standalone evaluation, the language. PLUTO MT systems are also benchmarked Japanese—English is a particularly challeng- against leading commercial systems across two ing pair due to the divergent word ordering be- MT paradigms: Google Translate for statistical tween the two languages. To overcome this, we MT and Systran (Enterprise) for rule-based MT. employ preordering of the input text (Talbot et A comparative analysis is carried out using both al. 2011) in order to harmonise the word ordering the automatic and human evaluation techniques between the two languages and reduce the likeli- described above. This comparison is also applied hood of ordering errors. This is done using a to the output of the PLUTO MT systems and the rule-based technique called head-finalisation output of the integrated TM/MT system in order (Isozaki et al., 2010) which moves the English to quantify the improvements achieved using the syntactic head towards the end of the phrase to translation memories. emulate the Japanese word order. The main findings from the first round of Finally, we use compound splitting and true evaluations for our French—English and Portu- casing modules for our English—German MT guese—English systems showed that our MT systems in order to reduce the occurrence of out- systems score relatively high based on human of-vocabulary words. judgments -- 3.8 out of 5 on average -- while be- ing ranked higher than the commercial systems 2.2 Translation memory integration approximately 75% of the time. More details on In order to further improve the translation these experiments can be found in Ceausu et al. quality, we are developing an engine to automat- (2011). ically combine the outputs of the MT system and a translation memory (TM). 3 Patent Translation Web Service The engine works by taking a patent docu- ment as input and searching for full matches on The PLUTO MT systems are deployed as a paragraph, sentence, and segment (sub- web service (Tinsley et al., 2010). The main en- sentential) level in the TM. If no full matches are try point for end users is through a web browser found, fuzzy matches are sought above a prede- plugin which allows them to access translations termined threshold and combined with the output on-the-fly regardless of the search engine being of the MT system using phrase- and word-level used to find relevant patents. In addition to the alignment information. browser plugin, users also have the option to in- For patents, most leverage from the TM is put text directly or upload patent documents in a seen at segment level, particularly as the patent number of formats including PDF and MS Word. claims are often written using quite a rigid struc- A number of further natural language pro- ture. This is due to that fact that, as patents typi- cessing techniques are exploited to improve the cally describe something novel which may never user experience. N-gram based language identifi- cation is used to send input to the correct MT system; while frequency based keyword extrac-

70 tion provides users with potentially important the Association of Computational Linguistics terms with which to carry out subsequent search- (ACL-05), Ann Arbor, MI. es. Ceausu, Alexandru, John Tinsley, Andrew Way, Corresponding source and target segments are Jian Zhang, Paraic Sheridan, Experiments on highlighted on both word and phrase level, while Domain Adaptation for Patent Machine users have the option of post-editing translations Translation in the PLuTO project, The 15th which are stored in a personal terminology data- Annual Conference of the European Associa- base and applied to future translations. tion for Machine Translation, EAMT-2011, The entire framework has been designed to Leuven, Belgium facilitate the patent professional in their daily Green, T., The necessity of syntax markers. two workflow. It provides them with a consistency of experiments with artificial languages. Journal translation quality and features regardless of the of Verbal Learning and Behavior, search tools being used to locate relevant patents. 18:481{496}, 1979. This has been validated through extensive us- Isozaki, H., Sudoh, K., Tsukada, H., and Duh, K. er experience testing which included a usability Head finalization: A simple reordering rule evaluation of the translation output. for SOV languages. In Proceedings of the 5th Workshop on Machine Translation (WMT), 4 Looking Forward Upsala, Sweden. Ma, Yanjun, Nicolas Stroppa, and Andy Way. The PLUTO project has been running for just 2007. Boostrapping Word Alignment via over two years and is scheduled to end in March Word Packing. In Proceedings of the 45th An- 2013. Our goal by that time is to have established nual Meeting of the Association for Computa- a viable commercial offering to capitalize on the tional Linguistics (ACL 2007), Prague, Czech state-of-the-art research and development into Republic, pp.304—311 automated patent translation. Papineni, K., Roukos, S.,Ward, T., and Zhu,W.- In the meantime, we will continue to build J. (2002). BLEU: a Method for Automatic upon our existing work by building MT systems Evaluationof Machine Translation. In Pro- for additional language pairs and iteratively im- ceedings of the 40th Annual Meeting of the proving upon our baseline translation perfor- Association for Computational Linguistics mance. Significant effort will also be spent on (ACL-02), pages 311–318, Philadelphia, PA. optimising the integration of translation memo- Talbot, David, Hideto Kazawa, Hiroshi Ichikwa, ries with MT using techniques such as those de- Jason Katz-Brown, Masakazu Seno, Franz scribed in He et al. (2011). Och A Lightweight Evaluation Framework for Machine Translation Reordering, In Proceed- Acknowledgements ings of the Sixth Workshop on Statistical Ma- chine Translation (July 2011), Edinburgh, The PLuTO project (ICT-PSP-250416) is Scotland.pp. 12-21 funded under the European Union's ICT Policy Tinsley, J., A. Way and P. Sheridan Support Programme as part of the Competitive- 2010. PLuTO: MT for Online Patent Transla- ness and Innovation Framework Programme tion In Proceedings of the 9th Conferences of the Association for Machine Translation in the References Americas. Denver, CO, USA. Armstrong, S., M. Flanagan, Y. Graham, D. Groves, B. Mellebeek, S. Morrissey, N. Stroppa and A. Way. 2006. MaTrEx: Ma- i This paper is an extended abstract intended to accom- chine Translation Using Examples. TC-STAR pany an oral presentation. It is not intended to be a OpenLab on Speech Translation. Trento, Ita- standalone scientific article. ly. Banerjee, S. and Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Im- proved Correlation with Human Judgments. In Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of

71 ATLAS - Human Language Technologies integrated within a Multilingual Web Content Management System

Svetla Koeva Department of Computational Linguistics, Institute for Bulgarian Bulgarian Academy of Sciences [email protected]

Abstract framework with a content management component (i-Publisher)2 used for creating, The main purpose of the project ATLAS running and managing dynamic content-driven (Applied Technology for Language-Aided websites. Examples of such sites are i-Librarian,3 CMS) is to facilitate multilingual web content a free online library of digital documents that development and management. Its main may be personalised according to the user’s innovation is the integration of language needs and requirements; and EUDocLib,4 a free technologies within a web content online library of European legal documents. The management system. The language language processing framework of these processing framework, integrated with web websites provides automatic annotation of content management, provides automatic annotation of important words, phrases and important words, phrases and named entities, named entities, suggestions for categorisation suggestions for categorisation of documents, of documents, automatic summary automatic summary generation, and machine generation, and machine translation of translation of a summary of a document summaries of documents. A machine (Karagyozov et al. 2012). Six European Union translation approach, as well as methods for languages – Bulgarian, German, Greek, English, obtaining and constructing training data for Polish, and Romanian are supported. machine translation are under development. 2. Brief overview of existing content 1 Introduction management systems The main purpose of the European project ATLAS (Applied Technology for Language- The most frequently used open-source Aided CMS)1 is to facilitate multilingual web multilingual web content management systems content development and management. Its main (WordPress, Joomla, Joom!Fish, TYPO3, innovation is the integration of language Drupal)5 offer a relatively low level of technologies within a web content management multilingual content management. None of the system. ATLAS combines a language processing platforms supports multiple languages in their

1 http://www.atlasproject.eu

2 http://i-publisher.atlasproject.eu/

3 http://www.i-librarian.eu/

4 http://eudoclib.atlasproject.eu/

5 http://wordpress.com/, http://www.joomla.org/, http://www.joomfish.net/, http://typo3.org/, http://drupal.org/

72 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 72–76, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics native states. Instead, they rely on plugins to Guide8 are the following phrases: Object- handle this: WordPress uses the WordPress Oriented Programming, Objective-C language, Multilingual Plugin, Drupal needs a module Cocoa application, Cocoa program, etc. Even called Locale, and Joomla needs a module called though the language processing chains that are Joomfish. There are modules, like those provided applied differ from language to language, this by ICanLocalize6, than can facilitate selection approach offers a common ground for language within Drupal and WordPress of the material to processing and its results can be comfortably be translated, but the actual translation is done by used by advanced language components such as human translators. To the best of our knowledge, document classification, clause-based none of the existing content management summarisation, and statistical machine systems exploits language technologies to translation. Content navigation (such as lists of provide more sophisticated text content similar documents) based on interlinked text management. This is proved by the data annotations is also provided. published at the CMS Critic7 - an online media providing news, reviews, articles and interviews 4 Automatic categorisation for about 60 content management systems. Automatic document classification (assigning a Taking into account that the online data are in document to one or more domains or categories many cases multilingual and documents stored in from a set of labels) is of great importance to a a content management system are usually related modern multilingual web content management by means of sharing similar topics or domains it system. ATLAS provides automatic multi-label can be claimed that the web content management categorisation of documents into one or more systems need the power of modern language predefined categories. This starts with a training technologies. In comparison ATLAS offers the phase, in which a statistical model is created advantage of integration of natural language based on a set of features from already labelled processing in the multilingual content documents. There are currently four classifiers, management. two of which exploit the Naïve Bayesian 3 Selection of “core” words algorithm, the two others Relative entropy and Class-featured centroid, respectively. In the ATLAS suggests “core” words (plus phrases and classifying phase, the model is used to assign one named entities), i.e., the most essential words or more labels to unlabelled documents. The that capture the main topic of a given document. results from the different classifiers are Currently the selection of core words is carried combined and the final classification result is out in a two-stage process: identification of determined by a majority voting system. The candidates and ranking. For the identification automatic text categorisation is at the present stage a language processing chain is applied that stage able to handle documents in Bulgarian and consists of the following tools: sentence splitter, English. For example, the Cocoa Fundamentals tokenizer, PoS tagger, lemmatizer, word sense Guide is automatically categorised under the disambiguator (assigns a unique sense to a domain Computer science, and unter the Topics word), NP extractor (marks up noun phrases in Computer science, Graphics and Design, the text) and NE extractor (marks up named Database Management, and Programming. entities in the text). After this stage, the target core words are ranked according to their 5 Text summarization importance scores, which are estimated by features such as frequency, linguistic correlation, Two different strategies for obtaining summaries phrase length, etc., combined by heuristics to are used in ATLAS. The strategy for short texts is obtain the final ranking strategy. The core words based on identification of the discourse structure are displayed in several groups: named entities and produces a summary that can be classified as (locations, names, etc.) - both single words and a type of excerpt, thus it is possible to indicate phrases, and noun phrases - terms, multiword the length of the summary as a percentage of the expressions or noun phrases with a hight original text. Summarisation of short texts in frequency. For example among the “core” noun ATLAS draws on the whole language processing phrases extracted from Cocoa Fundamentals chain and also adds a couple of other modules to

6 http://www.icanlocalize.com/

7 http://www.cmscritic.com/

8 https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/CocoaFundamentals/CocoaFundamentals.pdf

73 the chain: clause splitting, anaphora resolution, 2002 until the present day on the East Europe discourse parsing and summarization. The information website9. The fiction subcorpus was method used for short texts (Cristea et al. 2005) compiled manually by harvesting freely available exploits cohesion and coherence properties of the texts on the Internet, scanning, and from text to build intermediate structures. Currently, donations by authors. So far, it consists of texts the short text summarisation modules are in Bulgarian, English, and German. The implemented for English and Romanian. subcorpus of informal texts consists of subtitles The strategy for long texts assembles a template of films: feature films, documentaries, and summary based on extraction of relevant animations, all part of the OPUS collection information specific to different genres and is for (Tiedemann 2009). Automatic collection of the time being still under development. corpora is preferred to manual, and for that purpose a set of simple crawlers was designed. 6 Machine translation They are modified for each source to ensure efficiency. Figure 1 presents some statistical data For i-Publisher, machine translation serves as a for the Bulgarian-English parallel corpus, the translation aid for publishing multilingual largest in the collection (the vertical axis shows content. The ability to display content in multiple the number of words, while the horizontal - the languages is combined with a computer-aided domain distribution). localization of the templates. Text for a localization is submitted to the translation engine 40000000 and the output is subject to human post- processing.

For i-Librarian and EuDocLib, and for any 30000000 website developed with i-Publisher, the machine translation engine provides a translation of the document summary provided earlier in the chain. 20000000 This will give the user rough clues about documents in different languages, and a basis to decide whether they are to be stored. 10000000 6.1 Obtaining training corpora

0 The development of a translation engine is Administrative Science Massmedia Fiction Informal particularly challenging, as the translation should be able to be used in different domains and Bulgarian English within different text genres. In addition, most of Figure 1 Bulgarian-English parallel corpus the language pairs in question belong to the less resourced group for which bilingual training and Two basic methods are used to enlarge the test material is available in limited amounts existing parallel corpora. In the first, the (Gavrila and Vertan 2011). For instance, parallel available training data for statistical machine corpora incorporating Bulgarian are relatively translation are extended by means of generating small and usually domain-specific, with mostly paraphrases (e.g. compound nouns are literary or administrative texts. ATLAS’ paraphrased into (semi-) equivalent phrases with administrative subcorpus contains texts from EU a preposition, and vice versa). The paraphrases legislation created between the years 1958 and can be classified as morphological (where the 2011, available as an online repositories, i.e., the difference is between the forms of the phrase EuroParl Corpus (Koehn 2005); the JRC-Acquis constituents), lexical (based on semantic (Steinberger 2006), and includes all the similarity between constituents) and phrasal accessible texts in the target languages. The (based on syntactic transformations). Paraphrase scientific / administrative subcorpus consists of generation methods that operate both on a single administrative texts published by the European monolingual corpus or on parallel corpus are Medicines Evaluation Agency (EMEA) in the discussed by Madnani and Dorr 2010. For years between 1978 and 2009. It is part of the instance, one of the methods for paraphrase OPUS collection (Tiedemann 2009). The mass generation from a monolingual corpus considers media subcorpus contains news reports as well as as paraphrases all words and phrases that are some other journalistic texts published in nine distributionally similar, that is, occurring with the Balkan languages and English from October

9 http://setimes.com/

74 same sets of anchors (Paşca and Dienes 2005). Like the architecture of the categorization An approach using phrase-based alignment engine, the translation system in ATLAS is able techniques shows how paraphrases in one to accommodate and use different third-party language can be identified using a phrase in a translations engines, such as those of Google, second language as a pivot (Bannard and Bing, and Yahoo. Callison-Burch 2005). The ATLAS machine translation module is still The second method performs automatic under development. Some experiments in generation of parallel corpora (Xu and Sun 2011) translation between English, German, and by means of automatic translation. This method Romanian have been performed in order to can be applied for language pairs for which define: what parameter settings are suitable for parallel corpora are still limited in quantity. If, language pairs with a rich morphology, what say, a Bulgarian-English parallel corpus exists, a tuning steps lead to significant improvements, Bulgarian Polish parallel corpus can be wheather the PoS-factored models improve constructed by means of automatic translation significantly the quality of results (Karagyozov from English to Polish. To control the quality of et al. 2012). the automatically generated data, multiple translation systems can be used, and the 7! Conclusion compatibility of the translated outputs can be To conclude, ATLAS enables users to create, calculated. Thus, both methods can fill gaps in organise and publish various types of the available data, the first method by extending multilingual documents. ATLAS reduces the existing parallel corpora and the second by manual work by using automatic classification of automatic construction of parallel corpora. documents and helps users to decide about a 6.2 Accepted approach document by providing summaries of documents and their translations. Moreover, the user can Given that the ATLAS platform deals with easily find the most relevant texts within large languages from different language families and document collections and get a brief overview of that the engine should support several domains, their content. A modern web content an interlingua approach is not suitable. Building management systems should help users come to transfer systems for all language pairs is also grips with the growing complexity of today’s time-consuming and does not make the platform multilingual websites. ATLAS answers to this easily portable to other languages. When all task. requirements and limitations are taken into account, corpus-based machine translation Acknowledgments paradigms are the best option that can be considered (Karagyozov et al. 2012). For the ATLAS (Applied Technology for Language- ATLAS translation engine it was decided to use a Aided CMS) is a European project funded under hybrid architecture combining example-based the CIP ICT Policy Support Programme, Grant and statistical machine translation at the word- Agreement 250467. based level (i.e., no syntactic trees will be used). The ATLAS translation engine interacts with References other modules of the system. For example, the document categorisation module assigns one or Bannard and Callison-Burch 2005: Bannard, Colin more domains to each document, and if no and Chris Callison-Burch. Paraphrasing with bilingual parallel corpora. In Proceedings of specific trained translation model for the ACL, pages 597–604, Ann Arbor, MI. respective domain exists, the user gets a warning that the translation may be inadequate with Cristea et al. 2005: Cristea, D., Postolache, O., Pistol, I. (2005). Summarisation through Discourse respect to lexical coverage. Each input item to Structure. Computational Linguistics and the translation engine is then processed by the Intelligent Text Processing, 6th International example-based machine translation component. Conference CICLing 2005 (pp. 632-644). Mexico If the input as a whole or important chunks of it City, Mexico: Springer LNSC, vol. 3406. are found in the translation database, the Gavrila 2011: Gavrila, M. Constrained recombination translation equivalents are used and, if necessary, in an example-based machine translation system. combined (Gavrila 2011). In all other cases the In M. L. Vincent Vondeghinste (Ed.), 15th input is sent further to the Moses-based machine Annual Conference of the European Association translation component which uses a part-of- for Machine Translation, Leuven, Belgium, pp. speech and domain-factored model (Niehues and 193-200. Waibel 2010).

75 Gavrila and Vertan 2011: Gavrila Monica and Cristina Vertan. Training data in statistical machine translation – the more, the better? In Proceedings of the RANLP-2011 Conference, September 2011, Hissar, Bulgaria, pp. 551-556. Karagyozov et al. 2012: Diman Karagiozov, Anelia Belogay, Dan Cristea, Svetla Koeva, Maciej Ogrodniczuk, Polivios Raxis, Emil Stoyanov and Cristina Vertan. i-Librarian – Free online library for European citizens, In Infotheca, Belgrade, to appear. Koehn 2005: Koehn, Ph. Europarl: A Parallel Corpus for Statistical Machine Translation, Proceedings of MT Summit, pp. 79–86. Madnani and Dorr 2010: Nitin Madnani and Bonnie Dorr. 2010. Generat- ing phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics, 36(3), pp. 341–388. Niehues and Waibel 2010: Niehues Jan and Alex Waibel, Domain Adaptation in Statistical Machine Translation using Factored Translation Models, Proceedings of EAMT 2010 Saint- Raphael. Paşca and Dienes 2005: Paşca, Marius and Péter Dienes. 2005. Aligning needles in a haystack: Paraphrase acquisition across the Web. In Proceedings of IJCNLP, Jeju Island, pp. 119-130. Steinberger et al. 2006: Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of LREC 2006. Genoa, Italy. Tiedemann 2009: Tiedemann, J. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In: N. Nicolov, K. Bontcheva, G. Angelova, R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol. V), John Benjamins, Amsterdam/Philadelphia, pp. 237–248. Xu and Sun 2011: Jia Xu and Weiwei Sun. Generating virtual parallel corpus: A compatibility centric method. In Proceedings of the Machine Translation Summit XIII.

76 Tree-based Hybrid Machine Translation

Andreas Kirkedal Centre for Computational Modelling of language Institute for International Language Studies and Computational Linguistics Copenhagen Business School [email protected]

Abstract and RBMT systems use hand-crafted rules based in linguistic knowledge. While the trees are gen- I present an automatic post-editing ap- erated differently, alignments between nodes and proach that combines translation systems subtrees in the generation phase can be computed. which produce syntactic trees as output. Based on the computed alignments, substitution The nodes in the generation tree and target- side SCFG tree are aligned and form can be performed between the trees. the basis for computing structural similar- The automatic post-editing approach proposed ity. Structural similarity computation aligns in this paper is based on structural similarity. subtrees and based on this alignment, sub- The tree structures are aligned and subtree sub- trees are substituted to create more accu- stitution based on the similarity of subtrees per- rate translations. Two different techniques formed. This knowledge-poor approach is com- have been implemented to compute struc- patible with the surface-near nature of SMT sys- tural similarity: leaves and tree-edit dis- tance. I report on the translation quality of tems, does not require other information than a machine translation (MT) system where what is available in the output, and ensures that both techniques are implemented. The ap- the approach is generic so it can, in principle, be proach shows significant improvement over applied to any language pair. the baseline for MT systems with limited training data and structural improvement 2 Hybrid Machine Translation for MT systems trained on Europarl. Hybrid machine translation (HMT) is a paradigm that seeks to combine the strengths of SMT 1 Introduction and RBMT. The different approaches have com- Statistical MT (SMT) and rule-based MT plementary strengths and weaknesses (Thurmair, (RBMT) have complimentary strengths and 2009) which have led to the emergence of HMT combining their output can improve translation as a subfield in machine translation research. quality. The underlying models in SMT lack The strength of SMT is robustness - i.e. it will linguistic sophistication when compared to always produce an output - and fluency due to the RBMT systems and there is a trend towards use of language models. A weakness of SMT is incorporating more linguistic knowledge by the lack of explicit linguistic knowledge, which creating hybrid systems that can exploit the make translation phenomena requiring such infor- linguistic knowledge contained in hand-crafted mation, e.g. long-distance dependencies, difficult rules and the knowledge extracted from large to handle. amounts of text. RBMT systems translate more accurately in Hierarchical phrases (Chiang, 2005) are en- cases without parse failure, since they can take coded in a tree structure just as linguistic trees. more information into account e.g. morpholog- Most RBMT systems also encode the analysis of ical, syntactic or semantic information, where a sentence in a tree. The rules generating hierar- SMT only uses surface forms. RBMT often suf- chical trees are inferred from unlabeled corpora fer from lack of robustness when parsing fails and 77 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 77–86, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics Jeg [jeg] 1S NOM @SUBJ #1->2 arbejder [arbejde] V PR AKT @FS-STA #2->0 hjemme [hjemme] ADV LOC @2 . [.] PU @PU #4->0 Figure 2: Disambiguated CG representation for I work at home. Dependency annotation is indicated by the #-character. 2008), creating training data from RBMT systems etc. The factored translation models also present a way to integrate rule-based parsing systems. The automatic post-editing approach proposed here does not exactly fit the classification of par- allel coupling approaches in Thurmair (2009). Other coupling architectures with post-editing work on words or phrases and generate confu- sion networks or add more information to iden- tify substitution candidates, while the units fo- cused on here are graphs and no additional infor- mation is added to the MT output. This approach does select a skeleton upon which transformations Figure 1: Hybrid system architecture. are conducted as in Rosti et al. (2007) and re- quires the RBMT system to generate a target side in lexical selection in transfer. RBMT systems are language analysis which must be available to the also very costly to build, and maintenance and de- post-editing systems, but does not require a new velopment can be very complex e.g. due to the syntactic analysis of noisy MT output. The archi- interdependency of rules. tecture of the hybrid system used in this paper is The post-editing approach attempts to incorpo- parallel coupling with post-editing. A diagram of rate the linguistic knowledge encoded in target- the implemented systems can be seen in Figure 1. side dependency trees into hierarchical trees pro- The dark grey boxes represent pre-existing mod- duced by an SMT system. ules and open source software and the light grey 2.1 Related work boxes represent the additional modules developed to implement the post-editing approach. System combinations by coupling MT systems serially or in parallel have been attempted before 2.2 RBMT Component e.g. via hypothesis selection (Hildebrand and Vo- gel, 2008), by combining translation hypotheses The Danish to English translation engine in locally using POS tags (Federmann et al., 2010) GramTrans (Bick, 2007) is called through an API. or by statistical post-editing (SPE) (Simard et al., The output is a constraint grammar (CG) analysis 2007). In hypothesis selection approaches, a num- on the target language side after all transfer and ber of MT systems produce translations for an n- target side transformation rules have been applied. best list and use a re-ranking module to rescore Example output is shown in Figure 2. In the anal- the translations. Using this approach, the best im- ysis, dependency information is provided and they provements are achieved with a large number of form the basis for creating the tree used for struc- systems running in parallel and this is not feasible tural similarity computation. Part-of-speech tags, in a practical application, mostly due to the com- source and target surface structure, sentence po- putational resources required by the component sition and dependency information are extracted systems. The translations will also not be better from the CG analysis. than the one produced by the best component sys- GramTrans is created to be robust and produce tem. Tighter integration of rule-based and statisti- as many dependency markings as possible to be cal approaches have also been proposed: Adding used in later translation stages. Errors in the as- probabilities to parse trees, pre-translation word signment of functional tags propagate to the de- reordering, enriching the phrase table with output pendency level and can result in markings that phrases from a rule-based system (Eisele et al., will produce a dependency tree and a number of 78 unconnected subgraphs with circularities. This presents a problem if the dependency markings are the basis for creating a dependency tree be- cause it is not straight-forward to reattach a sub- graph correctly, when the grammatical tags can- not be relied upon. Figure 3: The matching process. 2.3 SMT Component the option of combining partial hypotheses seri- A CKY+ algorithm for chart decoding is imple- ally and they make the hierarchical model as ro- mented in Moses (Koehn et al., 2007) for tree- bust as the traditional phrase-based approaches. based models and is used as the SMT component The Moses chart decoder was modified to out- system in this paper. put trace information from which the n-best hier- Hierarchical phrases are phrases that can con- archical trees can be reconstructed. The trace in- tain subphrases, i.e. a hierarchical phrase contains formation contains the derivations which produce non-terminal symbols. An example rule from the translation hypotheses. Danish to English: The sentence–aligned Danish-English part of Europarl (Koehn, 2005) was used for training, X1 i øvrigt X2 −→ moreover, X1 X2 and to tune parameters with MERT, the test set from the NAACL WMT 2006 was used (Koehn Xn is a nonterminal and the subscript identi- and Monz, 2006). GIZA++ aligns hierarchical fies how the nonterminals are aligned. The hierar- phrases which were extracted by Moses to train chical phrases are learned from bitext with unan- a translation model and a language model was notated data and are formally productions from trained with SRILM (Stolcke, 2002). Moses was a synchronous context-free grammar (SCFG) and trained using the Experimental Management Sys- can be viewed as a move towards syntax-based tem (EMS) (Koehn, 2010) and the configuration SMT (Chiang, 2005). Since hierarchical phrases followed the standard guidelines in the syntax tu- are not linguistic, Chiang makes a distinction be- torial.1 To train SRILM, the English side of Eu- tween linguistically syntax-based MT and for- roparl was used. mally syntax-based MT where hierarchical mod- els fall in the latter category because the struc- 3 Matching Approach tures they are defined over are not linguistically The post-editing approach relies on structures informed, i.e. unannotated bitexts. output by the component systems. It is neces- A hierarchical model is based on a SCFG and sary to find similar structures to perform subtree the elementary structures are rewrite rules: substitution. Matching structures is a problem in several application areas such as semantic web, X −→ hγ, α, ∼i schema and ontology integration, query media- As above, X is a nonterminal, γ and α are both tion etc. Structures include database schemas, di- strings of terminals and nonterminals and ∼ is a rectories, diagrams and graphs. Shvaiko and Eu- 1-to-1 correspondence between nonterminals in γ zenat (2005) provide a comprehensive survey of and α. As in shown previously, the convention is matching techniques. to use subscripts to represent ∼. The matching operation determines an align- To maintain the advantage of the phrase-based ment between two structures and an alignment is approach, glue rules are added to the rules that are a set of matching elements. A matching element 0 otherwise learned from raw data: is a quintuple: hid, e, e , n, Ri:

id Unique id. S −→ hS1X2,S1X2i 0 S −→ hX1,X1i e, e Elements from different structures. n Confidence measure. Only these rewrite rules contain the nontermi- 1http://www.statmt.org/moses/?n=Moses. nal S. These rules are added to give the model SyntaxTutorial 79 S

S X sat

S X the X cat on

X sat on mat the mat

the X the

cat

Figure 4: The refined alignment from dependency tree to hierarchical tree.

R The relation holding between the elements. word-based segmentation required by the CG parser; phrases recognised by the CG parser rarely The resources that can be used in the match- correspond to phrases in Moses and the hierarchi- 0 ing process are shown in Figure 3. o and o are cal phrase alignment is not easy to handle. the structures to be matched, A is an optional ex- Aligning hierarchical phrases like (a) in Figure isting alignment, r is external resources, p is pa- 5 is not complicated. The ordering is identical 0 rameters, weights and thresholds and A is the set and the Danish word offentliggøres is aligned to of matching elements created by the process. In will be published. The numbers 1-3 refer to the this paper, only matching elements with an equiv- alignment of non-terminal nodes based on phrase alence relation (=) are used. positions. The returned alignment can be a new alignment It is more complicated to align (b) in Figure or a refinement of A. o will be a dependency tree 5. There are two methods of handling this type of and o0 the hierachical trees from the SMT com- alignment appropriate for the component systems. ponent system. To compute the initial alignment Because there are an equal number of tokens in A between hierarchical and dependency trees, the the English phrase and Danish phrase, aligning source to target language phrase alignment output the tokens 1-1 monotonically would be a solution by the component systems is used. So the initial that, in this case, results in a correct alignment. alignment between leaf nodes in target-side trees Another approach relies on weak word reorder- are computed over the alignment to the source ing between Danish and English and would align language. findes with there are. This reduces the align- An important decision regarding this hybrid ap- ment problem to aligning vi der with we. In this proach is how to compute the alignment and the case, the alignment is noisy, but usable for creat- size of the substituted subtrees. Irrespective of ing matching elements. Both approaches are im- which technique is chosen to compute structural plemented in the hybrid system and the first ap- similarity, the resulting alignment should be re- proach supercedes the second due to the advan- fined to contain matching elements between inter- tage of correlating with the CG approach. nal nodes as shown in Figure 4. An initial element-level alignment between 3.1 Alignment Challenges nodes in a dependency tree and a hierarchical tree The change made to the chart decoder to output is computed over the source language and cre- matching elements the n-best trace information is simple and does not ates a set of containing aligned output the alignment information. Currently, the nodes. tree extraction module computes an alignment be- 3.2 Alignment Refinement tween the source and target language phrases. The segmentation of words into phrases done Between a dependency and an hierarchical tree, by Moses does not always correspond to the an element-level alignment needs to be refined to 80 (a) offentliggøres X : X -> will be published X : 1-3 (b) vi X der X findes : X -> X, we X there are : 1-3 3-0 Figure 5: Simplified example of a simple alignment. a structure-level alignment similar to the one in hierarchical trees resemble phrase structure trees Figure 4. and discontinuous phrases are handled using glue Not all matching elements in an initial align- rules. ment should be refined e.g. if both nodes in a To identify the corresponding subtree in the hi- matching element are leaf nodes, no refinement erarchical tree, the matching elements that contain is needed. Criteria for selecting initial matching the nodes in the dependency subtree are collected elements for refinement are needed. and a path from each leaf node to the root node is In the RBMT output, there are no indications of computed. The intersection of nodes is retrieved where the parser encountered problems. If a sur- and the root node of the subtree identified as the face form is an out-of-vocabulary (OOV) word, lowest node present in all paths. It is not always the morphological analyser is used to assign a lex- possible to find a common root node besides the ical category based on the word form, hypothesise root node of the entire tree. To prevent the loss of additional tags based on the analysis and proceed a high amount of structural information, the root with parsing. In the SMT output, an OOV marker node cannot be replaced or deleted. is appended to a surface form to indicate that the word has not been translated. The marker gives an 3.3 Substitution based on an edit script indication of where enriching a hierarchical tree An algorithm for computing structural similarity with RBMT output can result in improvement of is the Tree Edit Distance (TED) algorithm, which translation quality. computes how many operations are necessary for Based on these observations, hierarchical trees transforming one tree into another tree. Following are chosen to function as skeletons. Substi- Zhang and Shasha (1989) and Bille (2005), the tuting dependency subtrees into a hierarchical operations are defined on nodes and the trees are tree is more straightforward than using depen- ordered, labelled trees. There are 3 different edit dency trees as skeletons. It was not possible to operations: identify head-dependent relations based solely on the information contained in hierarchical subtrees rename Change the label of a node in a tree. while removing subtrees from hierarchical trees delete Remove a node n from a tree. Insert the and inserting dependency subtrees does not de- children of n as children of the parent of n so stroy linguistic information in the tree and depen- the sequence of children are preserved. The dency subtrees can easily be transformed into a deleted node may not be the root node. hierarchical-style subtree. insert Insert a node as the child of a node n in Leaves Based on the OOV marker, a matching a tree. A subsequence of children of n are technique based on leaf nodes is implemented to inserted as children of the new node so the refine matching elements and based on this align- sequence of children are preserved. An in- ment, substitute hierarchical subtrees with depen- sertion is the inverse operation of a deletion. dency subtrees. The dependency subtree is identified by collect- A cost function is defined for each operation. ing all descendants of a node. The descendants The goal is to find the sequence of edit operations are handled as leaf nodes because both leaf and that turns a tree T1 into another tree T2 with min- nonterminal nodes contain surface forms in a de- imum cost. The sequence of edit operations is pendecy tree. called an edit script and the cost of the optimal The dependency trees provided by GramTrans edit script is the tree edit distance. are not always projective. Subtrees may not rep- The cost functions should return a distance resent a continuous surface structure and a con- metric and satisfy the following conditions: tinuous subtree must be isolated before an align- ment between subtrees can be found because the 1. γ(i → j) ≥ 0 and γ(i → i) = 0 81 2. γ(i → j) = γ(j → i) The substitutions performed can be of very high quality but some untranslated words might not be 3. γ(i → k) ≤ γ(i → j) + γ(j → k) handled. If the system finds any OOV words in the hierachical tree after substitution, a rename γ is the cost of an edit operation. operation is carried out on the node. The edit distance mapping is a representation The extracted matching elements are noisy be- of an edit script. A rename operation is repre- cause they rely on the noisy source to target lan- sented as (i → j ) where the subscript denotes 1 2 guage alignment and the RBMT engine can also that the nodes i and j belong to different trees. produce an inaccurate translation making the sub- (i → ) represents a deletion and ( → j ) an 1 2 stitution counter-productive. Further limitations insertion. The cost of an edit distance mapping is given on the cost functions become too restrictive and by: produce too few matching elements. To avoid some of the noise, all permutations of applying X X X γ(M) = γ(i → j)+ γ(i → )+ γ( → j) substitutions based on the edit script are gener- (i,j)∈M i∈T1 j∈T2 ated, re-ranked and the highest scoring hypothesis j ∈ T2 means j is in the set of nodes in T2. chosen as the translation. It is important to note that the trees are ordered trees. The unordered version of the tree edit dis- 3.4 Generation tance problem is NP-hard, while polynomial algo- To ensure that the surface string generated from rithms based on dynamic programming exist for the newly created tree will have the correct word ordered trees. ordering, the dependency subtree is transformed The algorithm does not require an input align- before being inserted into the hierarchical tree. ment or external resources. The cost functions To create the insertion tree, the dependency nodes for deletion, insertion and renaming must be de- are inserted as leaf nodes of a dummy node. The fined on the information present in the nodes and dummy node is inserted before the root node of a unique id must be assigned to the nodes. This id the aligned hierarchical subtree and the informa- is assigned by traversing the tree depth-first and tion on the root node copied to the new node. assigning an integer as id. The algorithm visits Subsequently, the hierarchical nodes are removed each node in the trees in post order and determines from the tree. If both nodes in a matching element based on the cost assigned by the cost functions, are leaf nodes, the hierarchical node is relabeled which edit operation should be performed. with information from the dependency node. To generate matching elements that align dependency nodes to nonterminal hierarchical 4 Experiments nodes, cost functions for edit operations are mod- ified to assign a lower cost to rename operations The experiments have been conducted between where one of the nodes is a hierarchical nonter- Danish and English. The language model trained minal node. If two nodes have the same target with EMS is used to re-rank translation alterna- and source phrase, a rename operation does not tives. BLEU (Papineni et al., 2002), TER (Snover incur any cost and neither does the renaming of et al., 2006) and METEOR (Banerjee and Lavie, untranslated phrases. This ensures that matching 2005) scores will be reported. elements from the initial alignment that does not require refinement are not altered. Also, if the 4.1 Experimental Setup source is the same and the difference in sentence Two sets of five experiments have been con- position is no more than five, the renaming cost ducted. The first set of experiments use the initial is reduced. Experiments showed that a window 100,000 lines from Europarl for training Moses of five words was necessary to account for differ- and the second set of experiments use the full Eu- ences in sentence position and prevent alignment roparl corpus of ca. 1.8 mio sentences. The SMT to nodes later in the sentence with the same source baseline is the hierarchical version of Moses. phrase. This technique is independent of the OOV TED Skeleton Selection The impact of choos- marker and creates a structure-level alignment. ing the translation hypothesis with a minimal edit 82 Metrics: BLEU TER METEOR RBMT baseline 19.35 64.54 53.19 SMT baseline 30.16 (22.63) 57.16 (63.10) 59.51 (50.72) Lexical substitution 30.53 (25.28) 56.40 (60.56) 61.22 (57.24) Leaves technique 29.06 (21.96) 57.96 (64.80) 60.09 (54.32) TED skeleton(any bias) 30.16 (22.63) 57.08 (62.98) 59.46 (50.75) TED-R 1-best 29.78 (25.16) 57.25 (60.51) 59.87 (57.31) TED-R skeleton(any bias) 29.99 (25.18) 56.72 (60.44) 60.79 (57.34) Table 1: Automatic evaluation. 100k experiments in parentheses distance to the dependency tree from the rule- skeleton using TED. All three settings are tested based system is investigated. In one setting, the and the best performing experiment reported. cost functions adhere to the constrictions of com- puting a distance metric. Two settings test the 4.2 Evaluation impact of biasing the insertion and deletion cost The results of the automatic evaluation can be functions to assign a lower cost to inserting/delet- seen in Table 1. Skeleton indicates that TED was ing nonterminals, i.e. turning the dependency tree used to pick the hierarchical tree. The best evalu- into the hierarchical tree and vice versa. ations are in bold. TED is computed for 20 translation hypotheses 100k The RBMT baseline is outperformed by and the best performing setting reported. all hybrid configurations, though it does have a Leaves An experiment using the leaves tech- higher METEOR score than the SMT baseline nique has been conducted. The experiment is per- and skeleton selection. Lexical substitution and formed using the best hypothesis from Moses and TED-R obtains an increase of ca. 2.5 BLEU, 4 also using TED to chose the most structurally sim- TER and 4 METEOR points over the best base- ilar skeleton. The best setting will be reported. line scores. The leaves technique decreases the metrics except for METEOR and the skeleton se- Lexical substitution To be able to compare a lection only shows an insignificant improvement. more naive approach, subtree substitution based on the initial element-level alignment between Europarl Only lexical substitution improve all leaf nodes is used. In this approach, a subtree is metrics over the baseline. Using the leaves tech- one node. The technique is identical to using the nique again results in a decrease in BLEU and RBMT lexicon to lookup untranslated words and TER, but improves METEOR. The impact of inserting them in the translation. skeleton selection is similar to previous experi- ments, but the use of skeleton selection in TED-R TED-R An experiment where the mappings has become larger. that represent a rename operation, which are pro- duced during TED computation, are extracted and Manual Evaluation The evaluators rank 20 used as matching elements is conducted. Map- sentences randomly extracted from the test set on ping elements containing only punctuation or the a scale from 1-5 with 5 being the best and it is pos- root node of either tree are discarded. All com- sible to assign the same score to multiple transla- binations of substitutions based on the extracted tion alternatives. This evaluation was inspired by matching elements are performed and the highest the sentence ranking evaluation in Callison-Burch ranking hypothesis according to a language model et al. (2007). The five sentences to be evaluated is chosen as the final translation. will come from the RBMT and SMT baselines, The extracted matching elements may not in- lexical substitution, leaves technique and TED- corporate all the untranslated nodes. All untrans- R skeleton and the evaluators are 5 Danes who lated nodes are subsequently translated using lexi- have studied translation with English as second cal substitution as mentioned above. The subtrees language and 3 native English speakers. inserted into the hierarchical tree will undergo the The baseline systems make up 85% of the low- same transformation as the subtrees inserted using est ranking. The distribution between systems is the leaves technique. more even for the second lowest ranking with the This experiment is evaluated using both the 1- baselines only accounting for 52.6%. In the mid- best hierarchical tree as skeleton and choosing the dle ranking, the top scorer is lexical substitution 83 System 1 2 3 4 5 Avg. rank SMT ( COM ( 1999 ) 493 - C5-0320 SMT 53 64 30 12 1 2.025 baseline / 1999 - 1999 / 2208 ( COS ) ) RBMT 14 48 61 29 8 2.806 Leaves ( came ( 1999 ) 493 - Lex. sub. 3 33 63 58 3 3.156 C5-0320/1999-1999/2208 ( COM COS Leaves 6 33 61 55 5 3.125 ) ) - C5-0320 / 1999 - 1999 / 2208 ( TED-R 3 35 46 55 21 3.35 TED-R ( COM ( 1999 ) 493 - C5-0320/1999-1999/2208 Table 2: Rankings from the manual evaluation of the / 1999 - 1999 / 2208 ( COS ) ) second set of experiments. Table 3: Substitution of numbers. with a small margin to the RBMT baseline and are in the lower half and rankings for the hy- the leaves technique. The many assignments of brid systems tend more towards the mid-to-upper rank 3 could indicate that many of the transla- rankings, with TED-R having more distribution tions produced can be used for gisting, i.e. get around the second-best and highest score. This an impression of what information the source text indicates that the approach creates more accurate conveys, but not enough to give a complete under- translations. standing, but can also be a result of being the mid- The leaves technique consistently underper- dle value and chosen when the evaluators are in forms lexical substitution, but manual evaluation doubt. Lexical substitution is also the top scorer in shows a high correlation between the two meth- the second-best ranking, followed closely by the ods and their average ranks are similar. TED-R other hybrid configurations and the hybrid sys- is ranked higher than the leaves technique in the tems account for 80.3% of the second-best rank- metrics and manual evaluation also ranks TED- ings. TED-R recieves more top rankings than R higher than lexical substitution. This suggests the other systems combined (55.3%). The RBMT that the extra surface structure removed is not baseline achieves second-most top-rankings. This present in the reference translation and that TED- can be attributed to the cases where the rules did R is a better implementation of the post-editing not encounter unknown words and created very approach. accurate translations, as is the hallmark of RBMT. Subtree substitution, whether using leaves or 5 Discussion TED, does not handle parentheses, hyphens and numbers well. The structure severely degrades It is not surprising that lexical substitution when performing substitution near these environ- achieves a significant increase in all metrics. The ments. The example in Table 3 shows the er- approach only translates untranslated words using rors made by the substitution algorithm. An en- the RBMT lexicon. This can improve the transla- tire subphrase is duplicated using the leaves tech- tion or, because of noisy matching elements, in- nique which introduces an opening parenthesis troduce wrong words but the penalty incurred for with no closing counterpart and includes the erro- untranslated words and wrongly translated words neous translation came, while TED-R duplicates / is the same if the number of tokens is similar. Fur- 1999 - 1999 / 2208. ther, lexical substitution does not rely on struc- The reason for these wayward substitutions can tural similarity and can avoid the potential sources be found in the dependency tree. The matching of errors encountered at a later processing stage. parentheses are not part of the same subtree and Skeleton selection has little impact on the met- this is the root cause of the problem. The leaves rics and distinct derivations can result in the same technique is very sensitive to these errors and surface structure, giving the same scores, but it is there is no easy way to prevent spurious parenthe- evident that finding the most similar tree improves ses from being introduced. Re-ranking in TED- substitution. R could filter these hypotheses out, but because The improvements observed in the 100k exper- the re-ranking module cannot model this depen- iments are not evident in the metrics when the dency, the sentences with these errors are not al- full Europarl data is used. The more powerful ways discarded. In the manual evaluation cam- SMT system is able to handle more translations paign, the sentence from Table 3 was included but manual evaluation reveals a distribution where in the sample sentences. It would seem that the the majority of rankings for the baseline systems many evaluators did not view this error as impor- 84 tant or it was ignored. It would be impossible to bine systems where the structures are not as di- find the referenced Council decision based on the verse. Hierarchical systems are derived from a translations and dates or monetary amounts might SCFG so a RBMT system based on a CFG for- change drastically, which would not be acceptable malism such as LUCY, could be used to test the if the translated text should be ready for publish- generality of the hybridisation approach. ing after translation. For gisting, where the user As the TED-R approach does not rely on mark- knows that the translation is not perfect, this may ers for OOV words, an implementation where hi- constitute less of a problem. erarchical subtrees are inserted into the RBMT output should also be conducted. The problem 6 Future work of inserting CFG-style subtrees into a dependency The initial alignment is based on the source to tar- tree and generating the correct surface structure get language alignment. In the RBMT module, it must be resolved or a different RBMT system is mostly word-based while in Moses, the align- which produce CFG-style trees implemented. ment must be recomputed due to the simplicity The implementation of the leaves technique re- of the modification and that the Moses chart de- lies on the diversity of the tree structures, i.e. that coder cannot output word alignment. The mod- there are element-level similarities between hier- elling only handles alignment crossing one non- archical leaf nodes and both terminal and non- terminal and reduces alignment problems to these terminal dependency nodes and that the subtree cases by assuming a weak reordering. rooted in a dependency node can be aligned to Future work should include extracting the word a hierarchical subtree. The refinement method alignment from the SMT system to improve would have to be altered. The relations and chil- source to target language alignment. The MT de- dren techniques (Shvaiko and Euzenat, 2005) are coder Joshua can output complete derivations in- good candidates for similar tree structures. cluding word-based alignment which would elim- A change of formalism would not require alter- inate the need to recompute source to target lan- ations of the tree edit distance approach, as long guage alignment which currently produces noisy as the structures are in fact tree structures. matching elements. Experiments using a differ- ent RBMT engine should also be conducted. The RBMT module does not always produce one com- 7 Conclusion plete tree structure for a sentence and the reattach- ment algorithm handles this by adding any addi- The post-editing approach proposed in this pa- tional graphs to the root node of the tree structure. per combines the strengths of statistical and rule- A RBMT engine that produces complete deriva- based machine translation and improve transla- tions is likely to improve the translation quality. tion quality, especially for the least accurate trans- This will require different tree extraction modules lations. The structural and knowledge-poor ap- for Joshua and the RBMT engine, but otherwise proach is novel and has not been attempted before. the system can be reused as is. It exploits structural output to create hybrid trans- lations and uses the linguistic knowledge encoded 6.1 Languages and formalisms in structure and on nodes to improve the transla- The chosen languages are closely related Ger- tion candidates of hierarchical phrase-based MT manic languages. While the results seem promis- systems. ing, the applicability of the approach should Automatic evaluation shows a significant in- be tested on a more distant language pair, e.g. crease over the baselines when training data is Chinese-English or Russian-English if you wish limited and also improvement in TER and ME- to preserve the possibility of using METEOR TEOR for lexical substitution and TED-R with a for evaluation, but any distant pair for which an SMT system trained on the Europarl corpus. RBMT system exists can be used — provided a Manual evaluation on test data shows that hy- tree output is available. brid translations were generally ranked higher, in- The implementation substitutes dependency dicating that the hybrid approach produces more subtrees into a hierarchical CFG-style tree. A sec- accurate translations. ond test of the hybridisation approach is to com- 85 References P. Koehn. 2005. Europarl: A parallel corpus for statis- tical machine translation. In MT summit, volume 5. S. Banerjee and A. Lavie. 2005. METEOR: An auto- Citeseer. matic metric for MT evaluation with improved cor- P. Koehn. 2010. An experimental management sys- relation with human judgments. Intrinsic and Ex- tem. The Prague Bulletin of Mathematical Linguis- trinsic Evaluation Measures for Machine Transla- tics, 94(-1):87–96. tion and/or Summarization, page 65. K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. 2002. E. Bick. 2007. Dan2eng: Wide-coverage danish- BLEU: a method for automatic evaluation of ma- english machine translation. Proceedings of Ma- chine translation. In Proceedings of the 40th annual chine Translation Summit XI, pages 37–43. meeting on association for computational linguis- P. Bille. 2005. A survey on tree edit distance and tics, pages 311–318. Association for Computational related problems. Theoretical computer science, Linguistics. 337(1-3):217–239. A.V.I. Rosti, N.F. Ayan, B. Xiang, S. Matsoukas, C. Callison-Burch, C. Fordyce, P. Koehn, C. Monz, R. Schwartz, and B. Dorr. 2007. Combining out- and J. Schroeder. 2007. (Meta-) evaluation of ma- puts from multiple machine translation systems. In chine translation. In Proceedings of the Second Human Language Technologies 2007: The Confer- Workshop on Statistical Machine Translation, pages ence of the North American Chapter of the Associa- 136–158. Association for Computational Linguis- tion for Computational Linguistics; Proceedings of tics. the Main Conference, pages 228–235. P. Shvaiko and J. Euzenat. 2005. A survey of schema- D. Chiang. 2005. A hierarchical phrase-based model based matching approaches. Journal on Data Se- for statistical machine translation. In Proceed- mantics IV, pages 146–171. ings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 263–270. Associ- M. Simard, N. Ueffing, P. Isabelle, R. Kuhn, et al. ation for Computational Linguistics. 2007. Rule-based translation with statistical phrase-based post-editing. In ACL 2007 Second A. Eisele, C. Federmann, H. Uszkoreit, H. Saint- Workshop on Statistical Machine Translation. Amand, M. Kay, M. Jellinghaus, S. Hunsicker, M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and T. Herrmann, and Y. Chen. 2008. Hybrid ma- J. Makhoul. 2006. A study of translation edit rate chine translation architectures within and beyond with targeted human annotation. In Proceedings of Proceedings of the 12th the EuroMatrix project. In Association for Machine Translation in the Ameri- annual conference of the European Association for cas, pages 223–231. Citeseer. Machine Translation (EAMT 2008), pages 27–34. A. Stolcke. 2002. SRILM - an extensible language C. Federmann, A. Eisele, H. Uszkoreit, Y. Chen, modeling toolkit. In Proceedings of the interna- S. Hunsicker, and J. Xu. 2010. Further experiments tional conference on spoken language processing, with shallow hybrid mt systems. In Proceedings volume 2, pages 901–904. Citeseer. of the Joint Fifth Workshop on Statistical Machine Gregor Thurmair. 2009. Comparing different archi- Translation and MetricsMATR, pages 77–81. Asso- tectures of Hybrid Machine Translation systems. In ciation for Computational Linguistics. Proceedings of the MT Summit XII, pages 340–347. A.S. Hildebrand and S. Vogel. 2008. Combination K. Zhang and D. Shasha. 1989. Simple fast algo- of machine translation systems via hypothesis se- rithms for the editing distance between trees and lection from combined n-best lists. In MT at work: related problems. SIAM J. Comput., 18(6):1245– Proceedings of the Eighth Conference of the Asso- 1262. ciation for Machine Translation in the Americas, pages 254–261. Citeseer. P. Koehn and C. Monz. 2006. Manual and automatic evaluation of machine translation between european languages. In Proceedings of the Workshop on Sta- tistical Machine Translation, pages 102–121. Asso- ciation for Computational Linguistics. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 177–180. Association for Computational Lin- guistics. 86 Were the clocks striking or surprising? Using WSD to improve MT performance

Špela Vintar Darja Fišer Aljoša Vrščaj University of Ljubljana University of Ljubljana University of Ljubljana Dept. of Translation Studies Dept. of Translation Studies Dept. of Translation Studies SI - 1000 Ljubljana, SI - 1000 Ljubljana, SI - 1000 Ljubljana, Aškerčeva 2 Aškerčeva 2 Aškerčeva 2 [email protected] [email protected] [email protected]

For WSD we use UKB (Agirre and Soroa Abstract 2009), a graph-based algorithm that uses wordnet (Fellbaum 1998) and computes the probability of We report on a series of experiments aimed at improving the machine translation of ambig- each sense of a polysemous word by taking into uous lexical items by using wordnet-based account the senses of context words. In our ex- unsupervised Word Sense Disambiguation periment we use Orwell's notorious novel 1984 (WSD) and comparing its results to three MT as the source and its translation into Slovene by systems. Our experiments are performed for Alenka Puhar as the reference translation. We the English- pair using then disambiguate the English source with UKB, UKB, a freely available graph-based word assign each disambiguated English word a Slo- sense disambiguation system. Results are vene equivalent from sloWNet (Fišer 2009) and evaluated in three ways: a manual evaluation compare these with the equivalents proposed by of WSD performance from MT perspective, Google, Bing and Presis. Results are evaluated in an analysis of agreement between the WSD- proposed equivalent and those suggested by several ways: the three systems, and finally by computing • By manually evaluating WSD perfor- BLEU, NIST and METEOR scores for all mance from the MT perspective, translation versions. Our results show that WSD performs with a MT-relevant precision • By analysing the agreement between of 71% and that 21% of sense-related MT er- each of the MT systems and the rors could be prevented by using unsuper- UKB/wordnet-derived translation, vised WSD. • By comparing BLEU, NIST and ME- TEOR scores achieved with each transla- 1 Introduction tion version. Ambiguity continues to be a tough nut to crack in Our results show that the ad hoc WSD MT. In most known languages certain lexical strategies used by the evaluated MT systems can items can refer to more than a single concept, definitely be improved by a proper WSD meaning that MT systems need to choose be- algorithm, but also that wordnet is not the ideal tween several translation equivalents represent- semantic resource to help resolve translation ing different senses of the source word. Wrong dilemmas, mainly due to its fine sense choices often result in grave translation errors, as granularity. words often refer to several completely unrelated concepts. The adjective striking can mean beauti- 2 Word Sense Disambiguation and ful, surprising; delivering a hard blow or indicat- Machine Translation ing a certain time, and the noun “course” can be something we give, take, teach or eat. Wordnet-based approaches to improving MT Our aim was to assess the performance of have been successfully employed by numerous three MT systems for the English-Slovene lan- authors, on the one hand as a semantic resource guage pair and to see whether wordnet-based to help resolve ambiguity, and on the other hand Word Sense Disambiguation (WSD) could im- as a rich source of domain-specific translation prove performance and assist in avoiding grave equivalents. As early as 1993 (Knight 1993), sense-related translation errors. wordnet was used as the lower ontology within

87 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 87–92, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics the PANGLOSS MT system. Yuseop et al. two belonging to the family of freely available (2002) have employed LSA and the semantic statistical systems and the latter being a rule- similarity of wordnet literals to translate colloca- based MT system developed by the Slovenian tions, while Salam et al. (2009) used wordnet for company Amebis. disambiguation and the choice of the correct For the purposes of further analysis and com- translation equivalent in an English to Bengali parison with our disambiguated corpus all texts - SMT system. original and translations - have been PoS-tagged WSD for machine translation purposes slightly and lemmatized using the JOS web service (Er- differs from traditional WSD, because distinct javec et al. 2010) for Slovene and ToTaLe (Er- source language senses, which share the same javec et al. 2005) for English. Because we can translation equivalent, need not be differentiated only disambiguate content words, we retained in WSD (Vickrey et al. 2005). This phenomenon only nouns, verbs, adjectives and adverbs and is known as parallel ambiguities and is particu- discarded the rest. After all these preprocessing larly common among related languages (Resnik steps our texts end up looking as follows: and Yarowsky 2000). Although early experi- ments failed to provide convincing proof that English: WSD can improve SMT, Carpuat and Wu It was a bright cold day in April and the clocks were (2007), Chan et al. (2007) and Ali et al. (2009) striking thirteen. English-preprocessed: clearly demonstrate that incorporating a word be bright cold day April clock be strike sense disambiguation system on the lexical level Slovene-reference: brings significant improvement according to all Bil je jasen, mrzel aprilski dan in ure so bile trinajst. common MT evaluation metrics. Slovene-reference-preprocessed: biti biti jasen mrzel aprilski dan ura biti biti Still, using wordnet as the source of sense in- Slovene-Google: ventories has been heavily criticized not just in Bilo je svetlo mrzel dan v aprilu, in ure so bile trinajst the context of MT (Apidianaki 2009), but also presenetljiv. within other language processing tasks. The most Slovene-Google-preprocessed: notorious arguments against wordnet are its high biti biti svetlo mrzel dan april ura biti biti presenetljiv Slovene-Bing: granularity and - as a consequence - high similar- Je bil svetlo hladne dan aprila in v ure so bili presenetljivo ity between some senses, but its global availabil- trinajst. ity and universality seem to be advantages that Slovene-Bing-preprocessed: prevail in many cases (Edmonds and Kilgarriff biti biti svetlo hladen dan april ura biti biti presenetljivo Slovene-Presis: 2002). Svetal hladen dan v aprilu je bilin so ure udarjale trinajst. Our experiments lie somewhat in between; on Slovene-Presis-preprocessed: the one hand we demonstrate the potential of svetel hladen dan april biti bilin biti ura udarjati WSD in MT, especially for cases where different Figure 1. Corpus preprocessing MT systems disagree, and on the other hand we attribute most WSD errors to the inadequacy of 3.2 Disambiguation with UKB and wordnet the sense splitting in wordnet (see Discussion). The aim of semantic annotation and disambig- uation is to identify polysemous lexical items in 3 Experimental setup the English text and assign them the correct 3.1 Corpus and MT systems sense in accordance with the context. Once the sense of the word has been determined, we can Our corpus consists of George Orwell's novel exploit the cross-language links between word- 1984, first published in English in 1949, and its nets of different languages and propose a Slo- translation into Slovene by Alenka Puhar, first vene translation equivalent from the Slovene published in 1967. While it may seem unusual to wordnet. be using a work of fiction for the assessment of We disambiguated the English corpus with MT systems, literary language is usually richer in UKB, which utilizes the relations between ambiguity and thus provides a more complex synsets and constructs semantic graphs for each semantic space than non-fiction. candidate sense of the word. The algorithm then We translated the entire novel into Slovene with Google Translate1, Bing2 and Presis3, the first 2 http://www.microsofttranslator.com/ (available for Slo- vene since 2010) 1 http://translate.google.com (translation from and into 3 http://presis.amebis.si (available for English-Slovene since Slovene has been available as of September 2008) 2002)

88 computes the probability of each graph based on • W: 0.0526668 ID: eng-30-04679419-n ENGWN: face, (the the number and weight of edges between the general outward appearance of something) SLOWN: nodes representing semantic concepts. Disam- podoba, (EMPTYDEF) Figure 2. Disambiguation result for the word face biguation is performed in a monolingual context with probabilities for each of the twelve senses for single- and multiword nouns, verbs, adjec- tives and adverbs, provided they are included in As can be seen in Table 1, almost half of all the English wordnet. the tokens in the corpus are considered to be am- Figure 2 shows the result of the disambigua- biguous according to the English wordnet. Since tion algorithm for the word face, which has as the Slovene wordnet is considerably smaller than many as 13 possible senses in wordnet. We are the English one, almost half of the different am- given the probability of each sense in the given biguous words occurring in our corpus have no context (eg. 0.173463) and the ID of the synset equivalent in sloWNet. This could affect the re- (eg. eng-30-05600637-n), and for the purposes of sults of our experiment, because we cannot eval- clarity we also added the literals (words) associ- uate the potential benefit of WSD if we cannot ated with this particular synset ID in the English compare the translation equivalent from sloWNet (face, human face) and Slovene (fris, obraz, fa- with the solutions proposed by different MT sys- ca) wordnet respectively. As can be seen from tems. We therefore restricted ourselves to the this example, wordnet is - in most cases - a very words and sentences for which an equivalent ex- fine-grained sense inventory, and looking at the ists in sloWNet. Slovene equivalents clearly shows that many of these senses may partly or entirely overlap, at Corpus size in tokens 103,769 least in the context of translation. Corpus size in types 10,982 WSD: ctx_Oen.1.1.2 24 !! face Ambiguous tokens 48,632 • W: 0.173463 ID: eng-30-05600637-n ENGWN: face, Ambiguous types 7,627 human face, (the front of the human head from the forehead Synsets with no 3,192 to the chin and ear to ear) SLOWN: fris, obraz, faca, človeški obraz, (EMPTYDEF) equivalent in sloWNet • W: 0.116604 ID: eng-30-08510666-n ENGWN: side, face, Table 1. Corpus size and number of ambiguous (a surface forming part of the outside of an object) SLOWN: stranica, ploskev, (EMPTYDEF) words • W: 0.0956895 ID: eng-30-03313602-n ENGWN: face, (the side upon which the use of a thing depends (usually the most One method of evaluating the performance of prominent surface of an object)) SLOWN: sprednja stran, prava stran, zgornja stran, lice, (EMPTYDEF) WSD in the context of Machine Translation is • W: 0.0761554 ID: eng-30-04679738-n ENGWN: through metrics for automatic evaluation (BLEU, expression, look, aspect, facial expression, face, (the feelings NIST, METEOR etc.). We thus generated our expressed on a person's face) SLOWN: izraz, pogled, obraz, izraz na obrazu, (EMPTYDEF) own translation version, in fact a stripped version • W: 0.0709513 ID: eng-30-03313456-n ENGWN: face, (a similar to those in Figure 1 consisting only of vertical surface of a building or cliff) SLOWN: stena, fasada, content words in their lemmatized form. We (EMPTYDEF) • W: 0.0653514 ID: eng-30-06825399-n ENGWN: font, translated the disambiguated words with word- fount, typeface, face, case, (a specific size and style of type net, exploiting the cross-language universality of within a type family) SLOWN: font, pisava, črkovna the synset ID. However, since we can only pro- družina, vrsta črk, črkovna podoba, črkovni slog, (EMPTYDEF) pose translation equivalents for the words which • W: 0.0629878 ID: eng-30-04838210-n ENGWN: boldness, are included in wordnet, we had to come up with nerve, brass, face, cheek, (impudent aggressiveness) a translation solution for those which were not. SLOWN: predrznost, nesramnost, (EMPTYDEF) • W: 0.0610286 ID: eng-30-06877578-n ENGWN: grimace, Such words include proper names (Winston, face, (a contorted facial expression) SLOWN: spaka, Smith, London, Oceania), hyphenated com- grimasa, (EMPTYDEF) pounds (pig-iron, lift-shaft, gorilla-faced) and • W: 0.0605221 ID: eng-30-03313873-n ENGWN: face, (the striking or working surface of an implement) SLOWN: čelo, Orwellian neologisms (Minipax, Newspeak, podplat, udarna površina, (EMPTYDEF) thoughtcrime). We translated these words with • W: 0.0579952 ID: eng-30-05601198-n ENGWN: face, (the part of an animal corresponding to the human face) three alternative methods: SLOWN: obraz, (EMPTYDEF) • W: 0.0535548 ID: eng-30-05168795-n ENGWN: face, (status in the eyes of others) SLOWN: ugled, dobro ime, • Using a general bilingual dictionary, (EMPTYDEF) • W: 0.05303 ID: eng-30-09618957-n ENGWN: face, (a • Using the English-Slovene Wikipedia part of a person that is used to refer to a person) SLOWN: and Wiktionary, obraz, (EMPTYDEF)

89 • Using the automatically constructed bi- lingual lexicon from the English-Slovene The precision of WSD using this relaxed criteri- parallel Orwell corpus. on was 71%, with 6% so-called borderline cases. These include cases where the equivalent was The fourth option was to leave them untranslated semantically correct but had the wrong part of and simply add them to the generated Slovene speech (eg. glass door -> *steklo instead of version. steklen). 4 Evaluation 4.2 Agreement between each of the MT The number of meanings a word can have, the systems and the disambiguated degree of translation equivalence or the quality equivalent of the target text are all extremely disputable and It is interesting to compare the equivalents we vague notions. For this reason we wished to propose through our wordnet-based WSD evaluate our results from as many angles as pos- procedure with those suggested by the three MT sible, both manually and automatically. systems: Presis, Google and Bing.

4.1 Manual evaluation of WSD precision in Total no. of disambiguated tokens 13,737 the context of MT WSD = reference 3,933 Firstly, we were interested in the performance of WSD = Presis 4,290 the UKB disambiguation tool in the context of WSD = Google 4,464 MT. Since UKB uses wordnet as a sense invento- WSD = Bing 4,377 ry, the algorithm assigns a probability to each WSD = ref = Presis = Google = Bing 2,681 sense of a lexical item according to its context in WSD = ref ≠ Presis ≠ Google ≠ Bing 269 an unsupervised way. The precision of UKB for Table 3: Comparison of WSD/wordnet-based unsupervised WSD is reported at around 58% for equivalent and the translations proposed by all words and around 72% for nouns, but of Presis, Google, Bing and the reference transla- course these figures measure the number of cases tion where the algorithm selected the correct wordnet synset from a relatively fine-grained network of The comparison was strict in the sense that we possible senses (Agirre and Soroa 2009). only took into account the first Slovene equiva- We adjusted the evaluation task to an MT scenar- lent proposed within the same synset. Of the over io by manually checking 200 disambiguated 48k ambiguous tokens we obviously considered words and their suggested translation equiva- only those which had an equivalent in sloWNet, lents, and if the equivalent was acceptable we otherwise comparison with the MT systems counted it among the positive instances regard- would have been impossible. We can see from less of the selected sense. For example, the Eng- Table 2 that the WSD/wordnet-based equivalents lish word breast has four senses in wordnet: (1) most often agree with Google translation, and the upper frontal part of a human chest, (2) one that for approximately every fifth ambiguous of the two soft milk-secreting glands of a wom- word all systems agree with each other and with an, (3) meat carved from the breast of a fowl and the reference translation. (4) the upper front part of an animal correspond- If we also look at the number of cases where ing to the human chest. For the English sentence our WSD-wordnet-based equivalent is the only Winston nuzzled his chin into his breast... UKB one to agree with the reference translation, it is suggested the second sense, which is clearly safe to assume that these are the cases where wrong, but since the ambiguity is preserved in WSD could clearly improve MT. Of all the in- Slovene and the word prsi can be used for all of stances where WSD agrees with the reference the four meanings, we consider this a case of translation we can subtract the instances where successful disambiguation for the purposes of all systems agree, because these need no im- MT. provement. Of the remaining 1,252 ambiguous words, 269 or 20% were such that only the Translation correct incorrect borderline WSD/wordnet equivalent corresponded to the equivalent reference translation. Number/ % 142 (71%) 46 (23%) 12 (6%) Table 2: Manual evaluation of WSD perfor- mance for MT

90 4.3 Evaluation with metrics 5 Discussion Finally, we wanted to see how the Although employing WSD and comparing word- WSD/wordnet-based translation compares with net-based translation equivalents to those pro- the three MT systems using the BLEU, NIST and posed by MT systems scored no significant im- METEOR scores. For the purposes of this com- provement with standard MT evaluation metrics, parison we pre-processed all five versions of our we remain convinced that the other two evalua- corpus - original, reference translation, Presis, tion methods show the potential of using WSD, Google and Bing translation - by lemmatization, particularly with truly ambiguous words and not removal of all function words, removal of sen- those where sense distinctions are slight or vague. tences where the alignment was not 1:1, and fi- A manual inspection of the examples where MT nally by removal of the sentences which con- systems disagreed and our WSD-based equiva- tained lexical items for which there was no lent was the only one to agree with the reference equivalent in sloWNet. translation shows that these are indeed examples We then generated the sixth version by trans- of grave MT errors. For example, the word hand lating all ambiguous words with sloWNet (see in the sentence The clock's hands said six mean- Section 3), and for the words not included in the ing eighteen can only be translated correctly with English wordnet we used four alternative transla- a proper WSD strategy and was indeed mistrans- tion strategies; a general bilingual dictionary lated as roka (body part) by all three systems. If a (dict), wiktionary (wikt), a word-alignment lexi- relatively simplistic and unsupervised technique con (align) and amending untranslated words to such as the one we propose can prevent 20% of the target language version (amend). these mistakes, it is certainly worth employing at least as a post-processing step. BLEU (n=1) NIST METEOR The fact that we explore the impact of WSD Bing 0.506 3.594 0.455 on a work of fiction rather than domain-specific Google 0.579 4.230 0.481 texts may also play a role in the results we ob- Presis 0.485 3.333 0.453 tained, although it is not entirely clear in what WSD 0.440 3.258 0.429 way. We believe that in general there is more WSD-amend 0.410 3.308 0.430 ambiguity in literary texts meaning that a single WSD-dict 0.405 3.250 0.427 word will appear in a wider range of senses in a WSD-align 0.448 3.588 0.434 work of fiction than it would in a domain- WSD-wikt 0.442 3.326 0.429 specific corpus. This might mean that WSD for Table 4: Evaluation with metrics literary texts is more difficult, however our own experiments so far show no significant difference Table 3 shows the results of automatic evalua- in WSD performance. tion; the corpus consisted of 2,428 segments. We A look at the cases where WSD goes wrong can see that our generated version using disam- shows that these are typically words with a high biguated equivalents does not outperform any of number of senses which are difficult to differen- the MT systems on any metric, except once when tiate even for a human. The question from the the WSD-align version outperforms Presis on the title of this paper is actually a translation blunder NIST score and comes fairly close to the Bing made by both Google and Bing, since striking score. was interpreted in its more expressive sense and It is possible that the improvement we are try- translated into Slovene as presenetljiv [surpris- ing to achieve is difficult to measure with these ing]. However, UKB also got it wrong and chose metrics because our method operates on the level the sense defined as deliver a sharp blow, as with of single words, while the metrics typically eval- the hand, fist, or weapon instead of indicate a uate entire sentences and corpora. We are using a certain time by striking. While these meanings stripped version of the corpus, ie. only content may seem quite easy to tell apart, especially if words which can potentially be ambiguous, the preceding word in a sentence is clock, strike whereas the metrics are normally used to calcu- as a verb has as many as 20 senses in Princeton late the similarity between two versions of run- WordNet, and many of these seem very similar. ning text. Finally, the corpus we are using for In this case the Slovene translation we propose is automatic evaluation is very small. "less wrong" than the surprising solution offered by Google or Bing, because udarjati may actual- ly be used in the clock sense as well.

91 We might also assume that statistical MT sys- Athens, Greece, Association for Computational tems will perform worse on fiction; results in Linguistics. Table 3 show that both statistical systems outper- Marine Carpuat and Dekai Wu. 2007. Improving sta- form the rule-based Presis. Then again, Orwell's tistical machine translation using word sense dis- 1984 has been freely available as a parallel cor- ambiguation. Proceedings of Empirical Methods in pus for a very long time and it is therefore possi- Natural Language Processing and Computational ble that both Google and Bing have used it as Natural Language Learning (EMNLP-CoNLL). training data for their SMT model. Yee Seng Chan, Hwee Tou Ng and David Chiang. 2007. Word Sense Disambiguation Improves Sta- 6 Conclusion tistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association of Compu- We described an experiment in which we explore tational Linguistics (Prague, Czech Republic). 33- the potential of WSD to improve the machine 40. translation of ambiguous words for the English- Slovene language pair. We utilized the output of Philip Edmonds and Adam Kilgarriff. 2002. Introduc- UKB, a graph-based WSD tool using wordnet, to tion to the Special Issue on Evaluating Word Sense Disambiguation Systems. Natural Language Engi- select the appropriate equivalent from sloWNet. neering 8 (4): 279–291. Manual evaluation showed that the correct equivalent was proposed in 71% of the cases. We Tomaž Erjavec, Darja Fišer, Simon Krek and Nina then compared these equivalents with the output Ledinek. 2010. The JOS Linguistically Tagged of three MT systems. While the benefit of WSD Corpus of Slovene. Proceedings of the 7th Interna- tional Conference on Language Resources and could not be proven with the BLEU, NIST and Evaluation (LREC'10), Malta. METEOR scores, the correspondence of the WSD/wordnet-based equivalent with the refer- Christiane Fellbaum. 1998. WordNet: An Electronic ence translation was high. Furthermore it appears Lexical Database. Cambridge, MA: MIT Press. that in cases where MT systems disagree WSD Darja Fišer. 2009. Leveraging Parallel Corpora and can help choose the correct equivalent. Existing Wordnets for Automatic Construction of As future work we plan to redesign the exper- the Slovene Wordnet. Human language technolo- iment so as to directly use WSD as a post- gy: challenges of the information society, (LNCS processing step to machine translation instead of 5603). Berlin; Heidelberg: Springer: 359-368. generating our own stripped translation version. Kevin Knight. 1993. Building a large ontology for This would provide better comparison grounds. machine translation. Proceedings of the ARPA In order to improve WSD precision we intend to Human Language Technology Workshop, Plains- combine two different algorithms and use it only boro, New Jersey. in cases where both agree. Also, we intend to Philip Resnik and David Yarowsky. 2000. Distin- experiment with different text types and context guishing Systems and Distinguishing Senses: New lengths to be able to evaluate WSD performance Evaluation Methods for Word Sense Disambigua- in the context of MT on a larger scale. tion. Natural Language Engineering, 5(2): 113-133. Khan Md. Anwarus Salam, Mumit Khan and Tetsuro References Nishino. 2009. Example based English-Bengali Eneko Agirre and Aitor Soroa. 2009. Personalizing machine translation using wordnet. Proceedings of PageRank for Word Sense Disambiguation. Pro- TriSA'09, Japan. ceeding of the European Association of Computa- David Vickrey, Luke Biewald, Marc Teyssier in tional Linguistics conference (EACL09). Daphne Koller. 2005. Word-Sense Disambiguation Ola Mohammad Ali, Mahmoud Gad Alla and Mo- for Machine Translation. Proceedings of the Con- hammad Said Abdelwahab. 2009. Improving ma- ference Empirical Methods in Natural Language chine translation using hybrid dictionary-graph Processing (EMNLP). based word sense disambiguation with semantic Kim Yuseop, Jeong-Ho Chang in Byoung-Tak Zhang and statistical methods. International Journal of (2002): Target Word Selection Using WordNet and Computer and Electrical Engineering, 1/5. Data-Driven Models in Machine Translation. Pro- Marianna Apidianaki. 2009. Data-driven semantic ceedings of the Conference PRICAI’02: Trends in analysis for multilingual WSD and lexical selection Artificial Intelligence. in translation. Proceedings of the 12th Conference of the European Chapter of the ACL, pages 77–85,

92 Bootstrapping Method for Chunk Alignment in Phrase Based SMT

Santanu Pal Sivaji Bandyopadhyay

Department of Computer Science and Engi- Department of Computer Science and Engi- neering neering Jadavpur University Jadavpur University [email protected] [email protected]

1 Introduction Abstract The objective of the present research work is to The processing of parallel corpus plays analyze effects of chunk alignment in English – very crucial role for improving the over- Bengali parallel corpus in a Phrase Based Statis- all performance in Phrase Based Statisti- tical Machine Translation system. The initial sen- cal Machine Translation systems (PB- tence level aligned English-Bengali corpus is SMT). In this paper the automatic align- cleaned and filtered using a semi-automatic ments of different kind of chunks have process. More effective chunk level alignments been studied that boosts up the word are carried out by bootstrapping on the training alignment as well as the machine transla- corpus to the PB-SMT system. tion quality. Single-tokenization of The objective in the present task is to align the Noun-noun MWEs, phrasal preposition chunks in a bootstrapping manner using a Single (source side only) and reduplicated tokenized MWE aligned SMT model and then phrases (target side only) and the align- modifying the model by inserting the aligned ment of named entities and complex chunks to the parallel corpus after each iteration predicates provide the best SMT model of the bootstrapping process, thereby enhancing for bootstrapping. Automatic bootstrap- the performance of the SMT system. In turn, this ping on the alignment of various chunks method deals with the many-to-many word makes significant gains over the previous alignments in the parallel corpus. Several types best English-Bengali PB-SMT system. of MWEs like phrasal prepositions and Verb- The source chunks are translated into the object combinations are automatically identified target language using the PB-SMT sys- on the source side while named-entities and tem and the translated chunks are com- complex predicates are identified on both sides pared with the original target chunk. The of the parallel corpus. In the target side only, aligned chunks increase the size of the identification of the Noun-noun MWEs and re- parallel corpus. The processes are run in duplicated phrases are carried out. Simple rule- a bootstrapping manner until all the based and statistical approaches have been used source chunks have been aligned with the to identify these MWEs. The parallel corpus is target chunks or no new chunk alignment modified by considering the MWEs as single is identified by the bootstrapping process. tokens. Source and target language NEs are The proposed system achieves significant aligned using a statistical transliteration tech- improvements (2.25 BLEU over the best nique. These automatically aligned NEs and System and 8.63 BLEU points absolute Complex predicates are treated as translation ex- over the baseline system, 98.74% relative amples, i.e., as additional entries in the phrase improvement over the baseline system) table (Pal.et al 2010, 2011). Using this aug- on an English- Bengali translation task. mented phrase table each individual source chunk is translated into the target chunk and then validated with the target chunks on the target side. The validated source-target chunks are con-

93 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 93–100, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics sidered as further parallel examples, which in or not a bilingual phrase contains bilingual effect are instances of atomic translation pairs to MWEs. This approach was generalized in Car- the parallel corpus. This is a well-known practice puat and Diab (2010) where the binary feature is in domain adaptation in SMT (Eck et al., 2004; replaced by a count feature which is representing Wu et al., 2008). The preprocessing of the paral- the number of MWEs in the source language lel corpus results in improved MT quality in phrase. terms of automatic MT evaluation metrics. MWEs on the source and the target sides The remainder of the paper is organized as fol- should be both aligned in the parallel corpus and lows. Section 2 briefly elaborates the related translated as a whole. However, in the state-of- work. The PB-SMT system is described in Sec- the-art PB-SMT systems, the constituents of an tion 3. The resources used in the present work MWE are marked and aligned as parts of con- are described in Section 4. The various experi- secutive phrases, since PB-SMT (or any other ments carried out and the corresponding evalua- approaches to SMT) does not generally treat tion results have been reported in Section 5. The MWEs as special tokens. Another problem with conclusions are drawn in Section 6 along with SMT systems is the wrong translation of some future work roadmap. phrases. Sometimes some phrases are not found in the output sentence. Moreover, the source and 2 Related work target phrases are mostly many-to-many, particu- larly so for the English—Bengali language pair. A multi lingual filtering algorithm generates bi- The main objective of the present work is to see lingual chunk alignment from Chinese-English whether prior automatic alignment of chunks can parallel corpus (Zhou.et al, 2004). The algorithm bring any improvement in the overall perform- has three steps, first, the most frequent bilingual ance of the MT system. chunks are extracted from the parallel corpus, second, a clustering algorithm has been used for 3 PB-SMT System Description combining chunks which are participating for alignment and finally one English chunk is gen- The system follows three steps; the first step is erated corresponding to a Chinese chunk by ana- prepared an SMT system with improved word lyzing the highest co-occurrences of English alignment that produces a best SMT model for chunks. Bilingual knowledge can be extracted bootstrapping. And the second step is produced a using chunk alignment (Zhou.et al, 2004). The chunk level parallel corpus by using the best alignment strategies include the comparison of SMT model. These chunk level parallel corpuses dependency relations between source and target are added with the training corpus to generate the sentences. The dependency related candidates are new SMT model in first iteration. And finally the then compared with the bilingual dictionary and whole process repeats to achieve better chunk finally the chunk is aligned using the extracted level alignments as well as the better SMT dependency related words. Ma.et al. (2007) sim- model. plified the task of automatic word alignment as several consecutive words together correspond to 3.1 SMT System with improved Word a single word in the opposite language by using Alignment the word aligner itself, i.e., by bootstrapping on The initial English-Bengali parallel corpus is its output. Zhu and Chang (2008) extracted a dic- cleaned and filtered using a semi-automatic tionary from the aligned corpus, used the dic- process. Complex predicates are first extracted tionary to re-align the corpus and then extracted on both sides of the parallel corpus. The analysis the new dictionary from the new alignment re- and identification of various complex predicates sult. The process goes on until the threshold is like, compound verbs (Verb + Verb), conjunct reached. verbs (Noun /Adjective/Adverb + Verb) and se- An automatic extraction of bilingual MWEs is rial verbs (Verb + Verb + Verb) in Bengali are carried out by Ren et al. (2009), using a log like- done following the strategy in Das.et al. (2010). lihood ratio based hierarchical reducing algo- Named-Entities and complex predicates are rithm to investigate the usefulness of bilingual aligned following a similar technique as reported MWEs in SMT by integrating bilingual MWEs in Pal.et al (2011). Reduplicated phrases do not into the Moses decoder (Koehn et al., 2007). The occur very frequently in the English corpus; system has observed the highest improvement some of them (like correlatives, semantic redu- with an additional feature that identifies whether plications) are not found in English (Chakraborty

94 and Bandyopadhyay, 2010). But reduplication and the candidates having scores above the plays a crucial role on the target Bengali side as threshold value have been considered as MWEs. they occur with high frequency. These redupli- The repetition of noun, pronoun, adjective and cated phrases are considered as a single-token so verb are generally classified as two categories: that they may map to a single word on the source repetition at the (a) expression level and at the side. Phrasal prepositions and verb object combi- (b) contents or semantic level. In case of Bengali, nations are also treated as single tokens. Once the The expression-level reduplication are classified compound verbs and the NEs are identified on into five fine-grained subcategories: (i) Ono- both sides of the parallel corpus, they are assem- matopoeic expressions (khat khat, knock knock), bled into single tokens. When converting these (ii) Complete Reduplication (bara-bara, big big), MWEs into single tokens, the spaces are replaced (iii) Partial Reduplication (thakur-thukur, God), with underscores (‘_’). Since there are already (iv) Semantic Reduplication (matha-mundu, some hyphenated words in the corpus, hyphena- head) and (v) Correlative Reduplication tion is not used for this purpose. Besides, the use (maramari, fighting). of a special word separator (underscore in this For identifying reduplications, simple rules case) facilitates the job of deciding which single- and morphological properties at lexical level token MWEs to be de-tokenized into its constitu- have been used (Chakraborty and Bandyop- ent words, before evaluation. adhyay, 2010). The Bengali monolingual dic- tionary has been used for identification of seman- 3.1.1 MWE Identification on Source Side tic reduplications. The UCREL1 Semantic analysis System An NE and Complex Predicates parallel cor- (USAS) developed by Lancaster University pus is created by extracting the source and the (Rayson.et al, 2004) has been adopted for MWE target (single token) NEs from the NE-tagged identification. The USAS is a software tool for parallel corpus and aligning the NEs using the the automatic semantic analysis of English spo- strategies as applied in (Pal.et al, 2010, 2011). ken and written data. Various types of Multi- Word Units (MWU) that are identified by the 3.1.3 Verb Chunk / Complex Predicate Alignment USAS software include: verb-object combina- tions (e.g. stubbed out), noun phrases (e.g. riding Initially, it is assumed that all the members of the boots), proper names (e.g. United States of English verb chunk in an aligned sentence pair America), true idioms (e.g. living the life of Ri- are aligned with the members of the Bengali ley) etc. In English, Noun-Noun (NN) com- complex predicates. Verb chunks are aligned pounds, i.e., noun phrases occur with high fre- using a statistical aligner. A pattern generator quency and high lexical and semantic variability extracts patterns from the source and the target (Tanaka.et al, 2003). The USAS software has a side based on the correct alignment list. The root reported precision value of 91%. form of the main verb, auxiliary verb present in the verb chunk and the associated tense, aspect 3.1.2 MWE Identification on Target Side and modality information are extracted for the Compound nouns are identified on the target source side token. Similarly, root form of the side. Compound nouns are nominal compounds Bengali verb and the associated vibhakti (inflec- where two or more nouns are combined to form a tion) are identified on the target side token. Simi- single phrase such as ‘golf club’ or ‘computer lar patterns are extracted for each alignment in science department’ (Baldwin.et al, 2010). Each the doubtful alignment list. element in a compound noun can function as a Each pattern alignment for the entries in the lexeme in independent of the other lexemes in doubtful alignment list is checked with the pat- different context. The system uses Point-wise terns identified in the correct alignment list. If Mutual Information (PMI), Log-likelihood Ratio both the source and the target side patterns for a (LLR) and Phi-coefficient, Co-occurrence meas- doubtful alignment match with the source and the urement and Significance function (Agarwal.et target side patterns of a correct alignment, then al, 2004) measures for identification of com- the doubtful alignment is considered as a correct pound nouns. Final evaluation has been carried one. out by combining the results of all the methods. The doubtful alignment list is checked again to A predefined cut-off score has been considered look for a single doubtful alignment for a sen- tence pair. Such doubtful alignments are consid- 1 http://www.comp.lancs.ac.uk/ucrel ered as correct alignment.

95 The above alignment list as well as NE (on/IN/B-PP) (the/DT/B-NP sun/NN/I-NP aligned lists are added with the parallel corpus kissed/VBN/I-NP copacabana/NN/I-NP and/CC/I-NP for creating the SMT model for chunk alignment. ipanema/NN/I-NP beaches/NNS/I-NP) (././B-O) The system has reported 15.12 BLEU score for test corpus and 6.38 (73% relative) point im- Prepositional phrase expansion and extraction provement over the baseline system (Pal.et al, bodies 2011). of all ages , colors and sizes don 3.2 Automatic chunk alignment the very minimum in beachwear and 3.2.1 Source chunk extraction idle away The source corpus is preprocessed after identify- the days ing the MWEs using the UCREL tool and single on the sun kissed copacabana and ipanema tokenizing the extracted MWEs. The source sen- beaches tences of the parallel corpus have been parsed using Stanford POS tagger and then the chunks of the sentences are extracted using CRF chun- ker2 The CRF chunker detects the chunk bounda- ries of noun, verb, adjective, adverb and preposi- tional chunks from the sentences. After detection of the individual chunks by the CRF chunker, the boundary of the prepositional phrase chunks are expanded by examining the series of noun chunks separated by conjunctions such as 'comma', 'and' etc. or a single noun chunk fol- lowed by a preposition. For each individual chunk, the head words are identified. A synony- mous bag of words is generated for each head word. These bags of words produce more alter- native chunks which are decoded using the best Figure 1.System architecture of the Automatic chunk SMT based system (Section 3.1). Additional alignment model translated target chunks for a single source chunk are generated. 3.2.2 Target chunk extraction The target side of the parallel corpus is cleaned CRF Chunker output and parsed using the shallow parser developed by bodies/NNS/B-NP of/IN/B-PP all/DT/B-NP the consortia mode project “Development of In- ages/NNS/I-NP ,/,/O colors/NNS/I-NP and/CC/O dian Language to Indian Language Machine sizes/NNS/I-NP don/VB/B-VP the/DT/B-NP Translation (IL-ILMT) System Phase II” funded very/JJ/I-NP minimum/NN/I-NP in/IN/B-PP beach- by Department of Information Technology, Gov- wear/NN/B-NP and/CC/O idle/VB/B-VP away/RP/B- ernment of India. The individual chunks are ex- PRT the/DT/B-NP days/NNS/I-NP on/IN/B-PP tracted from the parsed output. The individual the/DT/B-NP sun/NN/I-NP kissed/VBN/I-NP co- chunk boundary is expanded if any noun chunk pacabana/NN/I-NP and/CC/O ipanema/NN/I-NP contains only single word and several noun beaches/NNS/I-NP ././O chunks occur consecutively. The content of the

individual chunks are examined by checking Noun chunk Expansion and boundary detection their POS categories. At the time of boundary (bodies/NNS/B-NP) (of/IN/B-PP) (all/DT/B-NP expansion, if the system detects other POS cate- ages/NNS/I-NP ,/,/I-NP colors/NNS/I-NP and/CC/I- gory words except noun or conjunction then the NP sizes/NNS/I-NP) (don/VB/B-VP) (the/DT/B-NP expansion process stops immediately and new very/JJ/I-NP minimum/NN/I-NP) (in/IN/B-PP) chunk boundary beginning is identified. The IL- (beachwear/NN/B-NP) (and/CC/B-O) (idle/VB/B-VP) ILMT system generates the head word for each (away/RP/B-PRT) (the/DT/B-NP days/NNS/I-NP) individual chunk. The chunks for each sentence are stored in a separate list. This list is used as a 2 http://crfchunker.sourceforge.net/

96 validation resource for validate the output of the from the consortium-mode project “Development statistical chunk aligner. of English to Indian Languages Machine Trans- lation (EILMT) System Phase II3”. The Stanford 3.2.3 Source-Target chunk Alignment Parser4, Stanford NER, CRF chunker5 and the The extracted source chunks are translated using Wordnet 3.06 have been used for identifying the generated SMT model. The translated chunks complex predicates in the source English side of as well as their alternatives are validated with the the parallel corpus. original target chunk. During validation check- The sentences on the target side (Bengali) are ing, if any match is found between the translated parsed and POS-tagged by using the tools ob- chunk and the target chunk then the source chunk tained from the consortium mode project “De- is directly aligned with the original target chunk. velopment of Indian Language to Indian Lan- Otherwise, the source chunk is ignored in the guage Machine Translation (IL-ILMT) System current iteration for any possible alignment. The Phase II”. NEs in Bengali are identified using the source chunk will be considered in the next NER system of Ekbal and Bandyopadhyay alignment. After the current iteration is com- (2008). pleted, two lists are produced: a chunk level The effectiveness of the MWE-aligned and alignment list and an unaligned source chunk list. chunk aligned parallel corpus is demonstrated by The produced alignment lists are added with the using the standard log-linear PB-SMT model as parallel corpus as the additional training corpus our baseline system: GIZA++ implementation of to produce new SMT model for the next iteration IBM word alignment model 4, phrase-extraction process. The next iteration process translates the heuristics described in (Koehn et al., 2003), source chunks that are in the unaligned list pro- minimum-error-rate training (Och, 2003) on a duced by the previous iteration. This process held-out development set, target language model continues until the unaligned source chunk list is trained using SRILM toolkit (Stolcke, 2002) empty or no further alignment is identified. with Kneser-Ney smoothing (Kneser and Ney, 1995) and the Moses decoder (Koehn et al., 3.2.4 Source-Target chunk Validation 2007). The translated target chunks are validated with the original target list of the same sentence. The 5 Experiments and Evaluation Results extracted noun, verb, adjective, adverb and We have randomly identified 500 sentences each prepositional chunks of the source side may not for the development set and the test set from the have a one to one correspondence with the target initial parallel corpus. The rest are considered as side except for the verb chunk. There is no con- the training corpus. The training corpus was fil- cept of prepositional chunks on the target side. tered with the maximum allowable sentence Some time adjective or adverb chunks may be length of 100 words and sentence length ratio of treated as noun chunk on the target side. So, 1:2 (either way). Finally the training corpus con- chunk level validation for individual categories tains 13,176 sentences. In addition to the target of chunks is not possible. Source side verb side of the parallel corpus, a monolingual Ben- chunks are compared with the target side verb gali corpus containing 293,207 words from the chunks while all the other chunks on the source tourism domain was used for the target language side are compared with all the other chunks on model. The experiments have been carried out the target side. Head words are extracted for each with different n-gram settings for the language source chunk and the translated head words are model and the maximum phrase length and found actually compared on the target side taking into that a 4-gram language model and a maximum the consideration the synonymous target words. phrase length of 4 produce the optimum baseline When the validation system returns positive, the result. The rest of the experiments have been car- source chunk is aligned with the identified origi- ried out using these settings. nal target chunk.

4 Tools and Resources used 3 The EILMT and ILILMT projects are funded by the Department of Information Technology (DIT), Ministry A sentence-aligned English-Bengali parallel cor- of Communications and Information Technology (MCIT), pus containing 14,187 parallel sentences from the Government of India. travel and tourism domain has been used in the 4 http://nlp.stanford.edu/software/lex-parser.shtml 5 present work. The corpus has been collected http://crfchunker.sourceforge.net/ 6 http://wordnet.princeton.edu/

97 The system continues with the various pre- Training English Bengali processing of the corpus. The hypothesis is that set as more and more MWEs and chunks are identi- Iteration T U T U fied and aligned properly, the system shows the 1 81821 70321 65429 59627 improvement in the translation procedure. Table 2 67287 62575 50895 47139 1 shows the MWE statistics of the parallel train- final 32325 31409 15933 15654 ing corpus. It is observed from Table 1 that NEs occur with high frequency in both sides com- pared to other types of MWEs. It suggests that Table 2. Chunk Statistics. (T - Total occurrence, prior alignment of the NEs and complex predi- U – Unique) cates plays a role in improving the system per- formance. The system performance improves when the alignment list of NEs and complex predicates as well as sentence level aligned chunk are incorpo- rated in the baseline best system. It achieves the Training set English Bengali BLEU score of 17.37 after the final iteration. T U T U This is the best result obtained so far with respect CPs 4874 2289 14174 7154 to the baseline system (8.63 BLEU points abso- redupli- - - 85 50 lute, 98.74% relative improvement in Table 3). It cated word may be observed from Table 3 that baseline Noun-noun 892 711 489 300 Moses without any preprocessing of the dataset compound produces a BLEU score of 8.74. Phrasal 982 779 - - preposition Experiments Exp BLEU NIST Phrasal 549 532 - - verb Baseline 1 8.74 3.98 Total NE 22931 8273 17107 9106 words Best System (Alignment 2 15.12 4.48 of NEs and Complex Table 1. MWE Statistics. (T - Total occurrence, Predicates and Single U – Unique, CP – complex predicates, NE – Tokenization of various Named Entities) MWEs) Base- Iteration 1 3 15.87 4.49

line Iteration 2 4 16.28 4.51 Single tokenization of NEs and MWEs of any Best length on both the sides followed by GIZA++ Sys- Iteration 3 5 16.40 4.51 alignment has given a huge impetus to system tem + Iteration 4 6 16.68 4.52 performance (6.38 BLEU points absolute, 73% Chunk relative improvement over the baseline). In the Align Final Iteration† 7 17.37 4.55 source side, the system treats the phrasal preposi- ment tions, verb-object combinations and noun-noun compounds as a single token. In the target side, Table 3. Evaluation results for different experi- single tokenization of reduplicated phrases and mental setups. (The ‘†’ marked systems produce noun-noun compounds has been done followed statistically significant improvements on BLEU by alignments using the GIZA++ tool. From the over the baseline system) observation of Table 2, during first iteration there are 81821 chunks are identified from the source Intrinsic evaluation of the chunk alignment corpus and 14534 has been aligned by the sys- could not be performed as gold-standard word tem. For iteration 2, there are 67287 source alignment was not available. Thus, extrinsic chunks are remaining to align. At the final itera- evaluation was carried out on the MT quality tion almost 65% of the source chunks have been using the well known automatic MT evaluation aligned. metrics: BLEU (Papineni et al., 2002) and NIST (Doddington, 2002). Bengali is a morphologi- cally rich language and has relatively free phrase order. Proper evaluation of the English-Bengali

98 MT evaluation ideally requires multiple set of Baldwin, Timothy and Su Nam Kim Multiword Ex- reference translations. Moreover, the training set pressions, in Nitin Indurkhya and Fred J. Damerau was smaller in size. (eds.) Handbook of Natural Language Processing, Second Edition, CRC Press, Boca Raton, USA, pp. 6. Conclusions and Future work 267—292 (2010) Banerjee, Satanjeev, and Alon Lavie.. An Automatic A methodology has been presented in this paper Metric for MT Evaluation with Improved Correla- to show how the simple yet effective preprocess- tion with Human Judgments. In proceedings of the ing of various types of MWEs and alignment of ACL-2005 Workshop on Intrinsic and Extrinsic NEs, complex predicates and chunks can boost Evaluation Measures for MT and/or Summariza- the performance of PB-SMT system on an Eng- tion, pp. 65-72. Ann Arbor, Michigan., pp. 65-72. lish—Bengali translation task. The best system (2005) yields 8.63 BLEU points improvement over the Carpuat, Marine, and Mona Diab. Task-based Evalua- baseline, a 98.74% relative increase. A subset of tion of Multiword Expressions: a Pilot Study in the output from the best system has been com- Statistical Machine Translation. In Proc. of Human pared with that of the baseline system, and the Language Technology conference and the North output of the best system almost always looks American Chapter of the Association for Computa- better in terms of either lexical choice or word tional Linguistics conference (HLT-NAACL ordering. It is observed that only 28.5% of the 2010), Los Angeles, CA (2010) test set NEs appear in the training set, yet prior Chakraborty, Tanmoy and Sivaji Bandyopadhyay. automatic alignment of the NEs complex predi- Identification of Reduplication in Bengali Corpus cates and chunk improves the translation quality. and their Semantic Analysis: A Rule Based Ap- This suggests that not only the NE alignment proach. In proc. of the 23rd International Confer- quality in the phrase table but also the word ence on Computational Linguistics (COLING alignment and phrase alignment quality improves 2010), Workshop on Multiword Expressions: from significantly. At the same time, single- Theory to Applications (MWE 2010). Beijing, China. (2010) tokenization of MWEs makes the dataset sparser, but improves the quality of MT output to some Das, Dipankar, Santanu Pal, Tapabrata Mondal, Tan- extent. Data-driven approaches to MT, specifi- moy Chakraborty, Sivaji Bandyopadhyay. Auto- cally for scarce-resource language pairs for matic Extraction of Complex Predicates in Bengali which very little parallel texts are available, In proc. of the workshop on Multiword expression: from theory to application (MWE-2010), The 23rd should benefit from these preprocessing meth- International conference of computational linguis- ods. Data sparseness is perhaps the reason why tics (Coling 2010),Beijing, Chaina, pp. 37- single-tokenization of NEs and compound verbs, 46.(2010) both individually and in collaboration, did not add significantly to the scores. However, a sig- Doddington, George. Automatic evaluation of ma- chine translation quality using n-gram cooccur- nificantly large parallel corpus can take care of rence statistics. In Proc. of the Second International the data sparseness problem introduced by the Conference on Human Language Technology Re- single-tokenization of MWEs. search (HLT-2002), San Diego, CA, pp. 128- 132(2002) Acknowledgement Eck, Matthias, Stephan Vogel, and Alex Waibel. Im- The work has been carried out with support from proving statistical machine translation in the medi- the consortium-mode project “Development of cal domain using the Unified Medical Language English to Indian Languages Machine Transla- System. In Proc. of the 20th International Confer- tion (EILMT) System funded by Department of ence on Computational Linguistics (COLING Information Technology, Government of India. 2004), Geneva, Switzerland, pp. 792-798 (2004) Ekbal, Asif, and Sivaji Bandyopadhyay. Voted NER References system using appropriate unlabeled data. In proc. of the ACL-IJCNLP-2009 Named Entities Work- Agarwal, Aswini, Biswajit Ray, Monojit Choudhury, shop (NEWS 2009), Suntec, Singapore, pp.202- Sudeshna Sarkar and Anupam Basu. Automatic 210 (2009). Extraction of Multiword Expressions in Bengali: An Approach for Miserly Resource Scenario. In Huang, Young-Sook, Kyonghee Paik, Yutaka Sasaki, Proc. of International Conference on Natural Lan- “Bilingual Knowledge Extraction Using Chunk guage Processing (ICON), pp. 165-174.( 2004) Alignment”, PACLIC 18, Tokiyo, pp. 127-138, (2004).

99 Kneser, Reinhard, and Hermann Ney. Improved back- Pal, Santanu Tanmoy Chakraborty , Sivaji Bandyop- ing-off for m-gram language modeling. In Proc. of adhyay, “Handling Multiword Expressions in the IEEE Internation Conference on Acoustics, Phrase-Based Statistical Machine Translation”, Speech, and Signal Processing (ICASSP), vol. 1, Machine Translation Summit XIII(2011),Xiamen, pp. 181–184. Detroit, MI. (1995) China, pp. 215-224 (2011) Koehn, Philipp, Franz Josef Och, and Daniel Marcu. Papineni, Kishore, Salim Roukos, Todd Ward, and Statistical phrase-based translation. In Proc. of Wei-Jing Zhu. BLEU: a method for automatic HLT-NAACL 2003: conference combining Human evaluation of machine translation. In Proc. of the Language Technology conference series and the 40th Annual Meeting of the Association for Com- North American Chapter of the Association for putational Linguistics (ACL-2002), Philadelphia, Computational Linguistics conference series, Ed- PA, pp. 311-318 (2002) monton, Canada, pp. 48-54. (2003) Rayson, Paul, Dawn Archer, Scott Piao, and Tony Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris McEnery. The UCREL Semantic Analysis System. Callison-Burch, Marcello Federico, Nicola Ber- In proc. Of LREC-04 Workshop: Beyond Named toldi, Brooke Cowan, Wade Shen, Christine Entity Recognition Semantic Labeling for NLP Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Tasks, pages 7-12, Lisbon, Porugal (2004) Alexandra Constantin, and Evan Herbst. Moses: Ren, Zhixiang, Yajuan Lü, Jie Cao, Qun Liu, and Yun open source toolkit for statistical machine transla- Huang. Improving statistical machine translation tion. In Proc. of the 45th Annual meeting of the using domain bilingual multiword expressions. Association for Computational Linguistics (ACL In Proc. of the 2009 Workshop on Multiword Ex- 2007): Proc. of demo and poster sessions, Prague, pressions, ACL-IJCNLP 2009, Suntec, Singapore, Czech Republic, pp. 177-180. (2007) pp. 47-54 (2009). Koehn, Philipp. Statistical significance tests for ma- Stolcke, A. SRILM—An Extensible Language Mod- chine translation evaluation. In EMNLP-2004: eling Toolkit. Proc. Intl. Conf. on Spoken Lan- Proc. of the 2004 Conference on Empirical Meth- guage Processing, vol. 2, pp. 901–904, Denver ods in Natural Language Processing, 25-26 July (2002). 2004, Barcelona, Spain, pp 388-395. (2004) Tanaka, Takaaki and Timothy Baldwin. Noun- Noun Ma, Yanjun, Nicolas Stroppa, AndyWay. Proceedings Compound Machine Translation: A Feasibility of the 45th Annual Meeting of the Association of Study on Shallow Processing. In Proc. of the Asso- Computational Linguistics, ,Prague, Czech Repub- ciation for Computational Linguistics- 2003, lic, June 2007, pp. 304–311 (2007). Workshop on Multiword Expressions: Analysis, Moore, Robert C. Learning translations of named- Acquisition and Treatment, Sapporo, Japan, pp. entity phrases from parallel corpora. In Proc. of 17–24 (2003) 10th Conference of the European Chapter of the Wu, Hua Haifeng Wang, and Chengqing Zong. Do- Association for Computational Linguistics (EACL main adaptation for statistical machine translation 2003), Budapest, Hungary; pp. 259-266. (2003) with domain dictionary and monolingual cor- Och, Franz J. Minimum error rate training in statisti- pora. In Proc. of the 22nd International Conference cal machine translation. In Proc. of the 41st Annual on Computational Linguistics (COLING Meeting of the Association for Computational Lin- 2008), Manchester, UK, pp. 993-1000 (2008) guistics (ACL-2003), Sapporo, Japan, pp. 160-167. Xuan-Hieu Phan, "CRFChunker: CRF English Phrase (2003) Chunker", http://crfchunker.sourceforge.net/, Pal Santanu, Sudip Kumar Naskar, Pavel Pecina, (2006) Sivaji Bandyopadhyay and Andy Way. Handling Zhou, Yu, chengquing Zong, Bo Xu, “Bilingual Named Entities and Compound Verbs in Phrase- Chunk Aliment in Statistical Machine Translation”, Based Statistical Machine Translation, In proc. of IEEE International Conference on Systems, Man the workshop on Multiword expression: from the- and Cybernetics, pp. 1401-1406, (2004) ory to application (MWE-2010), The 23rd Interna- tional conference of computational linguistics (Col- ing 2010),Beijing, Chaina, pp. 46-54 (2010)

100 Design of a hybrid high quality machine translation system

Kurt Eberle Bogdan Babych Johanna Geiß Anthony Hartley Mireia Ginestí-Rosell Reinhard Rapp LingenioGmbH Serge Sharoff KarlsruherStraße10 Martin Thomas 69126Heidelberg,Germany CentreforTranslationStudies UniversityofLeeds Leeds,LS29JT,UK [k.eberle,j.geiss,m.ginesti-rosell] [B.Babych,A.Hartley,R.Rapp, @lingenio.de S.Sharoff,M.Thomas]@leeds.ac.uk these two approaches are viewed as Abstract complementary. AdvantagesofSMTarelowcostandrobustness, This paper gives an overview of the butdefinitedisadvantagesof(pure)SMTarethatit ongoing FP7 project HyghTra (2010 – needs huge amounts of data, which for many 2014). The HyghTra project is conducted languagepairsarenotavailableandareunlikelyto in a partnership between academia and becomeavailableinthefuture.Also,SMTtendsto industryinvolvingtheUniversityofLeeds disregardimportantclassificatoryknowledge(such andLingenioGmbH(company).Itadoptsa as morphosyntactic, categorical and lexical class hybrid and bootstrapping approach to the features), which can be provided and used enhancement of MT quality by applying relatively easily within nonstatistical rulebased analysis and statistical representations. evaluation techniques to both parallel and Ontheotherhand,advantagesofRBMTarethat comparable corpora in order to extract its(grammarandlexical)rulesandinformationare linguisticinformationandenrichthelexical understandablebyhumansandcanbeexploitedfor and syntactic resources of the underlying a lot of applications outside of translation (rulebased) MT system that is used for (dictionaries,textunderstanding,dialoguesystems, analysing the corpora. The project places etc.). special emphasis on the extension of The slot grammar approach used in Lingenio systems to new language pairs and systems (cf. McCord 1989, Eberle 2001) is a correspondingrapid,automatedcreationof prime example of such linguistically rich high qualityresources. The techniquesare representations that can be used for a number of fielded and evaluated within an existing different applications. Fig.1 shows this by a commercialMTenvironment. visualization of (an excerpt of) the entry for the ambiguousGermanverb einstellen inthedatabase 1 Motivation that underlies (a) the Lingenio translation products,whereitlinksupwithcorrespondingset Statistical Machine Translation (SMT) has been ofthetransferrules,and(b)Lingenio’sdictionary aroundforabout20years,andforroughlyhalfof product TranslateDict ,whichisprimarilyintended this time SMT and the 'traditional' Rule-based forhumantranslators. Machine Translation (RBMT) have been seen as competing paradigms. During the last decade however, there is a trend and growing interest in combiningthetwomethodologies.Inourapproach

101 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 101–112, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics 2 Development Methodology Many hybrid systems described in the literature have attempted to put some analytical abstraction ontopofanSMTkernel. 1Inourviewthisisnot the best option because, according to the underlying philosophy, SMT is linguistically ignorantatthebeginningandlearnsalllinguistic rules automatically from corpora. However, the extracted information is typically represented in Fig 1 a) databaseentry einstellen hugedatasetswhicharenotreadablebyhumansin ('translation'representslinksbetweenSLandTentries) a natural way. This means that this type of architecture does not easily provide interfaces for incorporating linguistic knowledge in a canonical andsimpleway. Thusweapproachtheproblemfromtheotherend, , integrating information derived from corpora using statistical methods into RBMT systems. Provided the underlying RBMT systems are linguistically sound and sufficiently modular in structure,webelievethistohavegreaterpotential forgeneratinghighqualityoutput. Fig 1 b) productentry einstellen Wecurrentlyuseandcarryoutthefollowingwork plan: TheobviousdisadvantagesofRBMTarehighcost, weaknessesindealingwithincorrectinputandin (I)CreationofMTsystems makingcorrectchoiceswithrespecttoambiguous (with rulebased core MT information and words,structures,andtransferequivalents. statisticalextensionandtraining): SMToutputisoftensurprisinglygoodwithrespect (a) We start out with declarative analysis and to short distance collocations, but often misses generation components of the considered correct choices are missed in cases where languages, and with basic bilingual dictionaries selectionalrestrictionstakeeffectondistantwords. connectingtooneanothertheentriesofrelatively RBMT output is generally good if the parser small vocabularies comprising the most frequent assignsthecorrectanalysistoasentenceandifthe wordsofeachlanguageinagiventranslationpair targetwordscanbecorrectlychosenfromtheset (cf.Fig1a). of alternatives. However, in the presence of (b) Having completed this phase, we extend the ambiguous words and structures, and where dictionaries and train the analysis, transfer and linguisticinformationislacking,thedecisionsmay generationcomponents of the rulebased core bewrong. systemsusingmonolingualandbilingualcorpora. Given the complementarity of SMT and RBMT andtheirverydifferentstrengthsandweaknesses, wetakeaviewthatanoptimizedMTarchitecture 1AprominentearlyexampleisFrederkingand must comprise elements of both paradigms. The colleagues(Frederking&Nirenburg,1994).Foran keyissuethereforeliesintheidentificationofsuch overviewofhybridMTtillthelateninetiesseeStreiter elementsandhowtoconnectthemtoeachother. etal.(1999).MorerecentapproachesincludeGroves& Weproposeaspecifictypeofhybridtranslation– Way(2006a,2006b).Commercialimplementations hy brid hi gh quality tra nslation (HyghTra), where include AppTek (http://www.apptek.com)and Language coreRBMTsystemsarecreatedandenhancedbya Weaver (http://www.languageweaver.com ).Anongoing MTimportantprojectinvestigatinghybridmethodsis rangeofreliablestatisticaltechniques. EuroMatrixPlus(http://www.euromatrixplus.net/)

102 (II)Errordetectionandimprovementcycle: Weexpectthattheproposedsemiautomatic (a) We automatically discover the most frequent creationofanewMTsystemassketchedabove problematic grammatical constructions and willworkbestifoneofthetwolanguagesinvolved multiwordexpressionsforcommercialRBMTand isalready'known'bymodulestowhichthesystem SMT systems using automatic constructionbased hasaccess.Againstthebackgroundofthepipeline evaluation as proposed in (Babych and Hartley, approachmentionedabovein(III),thismeansthat 2009) and develop a framework for fixing weassumeananalysisandtranslationsystemthat corresponding grammar rules and extending continuouslygrowsby'learning'newlanguages grammatical coverage of the systems in a semi where'learning'isfacilitatedbyinformationabout automaticway.Thisshortensdevelopmenttimefor thelanguagesalready'known'andbyexploiting commercial MT and contributes to yielding similarityassumptions–and,ofcourse,bybeing significantlyhighertranslationquality. fedwithinformationpreparedandprovidedbythe human'companion'ofthesystem. (III)Extensiontootherlanguages: Fromthisperspective,weassumethefollowing Structural similarity and translation by pivot stepsofextendingthesystem(withworkdoneby languages is used to obtain extension to further the'companion'andworkdonebythesystem) languages: Highquality translation between closely related 1. Acquireparallelandcomparablecorpora. languages (e.g., Russian and Ukrainian or 2. Define a core of the morphology of the new Portuguese and Spanish) can be achieved with languageandcompileabasicdictionaryforthe relatively simple resources (using linguistic most frequent words and translations. similarity, but also homomorphism assumptions Morphologicalrepresentationsandfeaturesfor with respect to parallel text, if available), while new languages are derived both manually and greatereffortsareputintoensuringbetterquality automatically, as proposed in (Babych et al., translation between more distant languages (e.g. 2012(inpreparation)). German and Russian). According to our prior 3. Using established alignment technology (e.g. research (Babych et al., 2007b) the pipeline Giza++) and parallel corpora, generate a first betweenlanguagesofdifferentsimilarityresultsin extensionofthisdictionary. improvedtranslationqualityforalargernumberof 4. Expand the dictionary of step 3 using language pairs (e.g., MT from Portuguese or comparable corpora as proposed in a study by UkrainianintoGermaniseasieriftherearehigh Rapp(1999).Thisisapplicablemainlytosingle quality analysis and transfer modules for Spanish wordunits. andRussianintoGerman(respectively).Ofcourse, 5. Expand coverage of multiwordunits using (III)drawsheavilyonthedetailedanalysisandMT noveltechnology. systems that the industrial partner in HyghTra 6. Crossvalidatethenewdictionarywithrespect providesforanumberoflanguages. toavailableonesbytransitivity. 7. Integrate the new dictionary into the new MT Inthefollowingsectionswegivemoredetailsof systemasdevelopingfromreusingcomponents theworkcurrentlydonewithregardto(I)andwith andaddingnewcomponentsasin8. regardtopartsof(II):thecreationofanewMT 8. Completemorphologyandspelloutdeclarative systemfollowingthestrategysketched.Wecannot analysis and generation grammar for the new gofurtherintodetailwith(II)and(III)here,which language. willbecomeapriorityforfutureresearch. 9. Automatically evaluate the translations of the most frequent grammatical constructions and 3 Creation of a new system multiword expressions in a machinetranslated corpus, prioritising support for these Earlypilotstudiescoveringsomeaspectsofthe constructions with a type of riskassessment strategydescribedhere(usinginformationfrom framework proposed in Babych and Hartley pivotlanguagesandsimilarity)showedpromising (2008). results(Rapp,1999;Rapp&MartínVide,2007; 10.Extend support for highpriority constructions seealsoKoehn&Knight,2002). semiautomatically by mining correct

103 translationsfromparallelcorpora. language) are typically also the most ambiguous 11.Train and evaluate the new grammar and ones. Since the Lingenio systems are lexically transferofthenewMTsystemusingthenew driven transfer systems (cf. Eberle 2001), we dictionary on the basis of available parallel define(a)structuralconditions,whichinformthe corpora. choiceofthepossibletargetwords(singlewords or multiword expressions) and (b)restructuring The following sections give an overview of the conditions, as necessary (cf. Fig 1 a: attributes differentsteps. 'transfer conditions' and 'structural change' ). In ordertoensurequalitythismustbedonebyhuman Step 1: Acquire parallel and comparable lexicographers and therefore costly for a large corpora dictionary.However,wemanuallycreateonlyvery As ourparallel corpus, we use the Europarl. The small basic dictionaries and extend these (semi size of the current version is up to 40 million automatically)step3andthosewhichfollow. wordsperlanguage,andseveralofthelanguages Some important morphosyntactic features of the wearecurrentlyconsideringarecovered.Also,we language are derived from a monolingual corpus make use of other parallel corpora such as the annotated with publicly available partofspeech Canadian Hansards (Proceedings of the Canadian taggers and lemmatisers. However, these tools Parliament)fortheEnglish–Frenchlanguagepair. oftendonotexplicitlyrepresentlinguisticfeatures For nonEU Languages (mainly Russian), we needed for the generation stage in RBMT. In intend to conduct a pilot study to establish the (Babych et al., 2012) we propose a systematic feasibility of retrieving parallel corpora from the approach to recovering such missing generation web,aproblemforwhichvariousapproacheshave orientedrepresentationsfromgrammarmodelsand beenproposed(Resnik,1999;Munteanu&Marcu, statistical combinatorial properties of annotated 2005;Wu&Fung,2005). features. Inadditiontotheparallelcorpora,wewillneed Step 3: Generating dictionary extensions from large monolingual corpora in the future (at least parallel corpora 200millionwords)foreachofthesixlanguages. Here, we intend to use newspaper corpora Based on parallel corpora, dictionaries can be supplemented with text collections downloadable derived using established techniques of automatic fromtheweb. sentence alignment and word alignment. For Thecorporaarestoredinadatabasethatallows sentence alignment, the lengthbased Gale & forassigninganalysesofdifferentdepthandnature Church aligner (1993) can be used, or – to the sentences and for alignment between the alternatively – Dan Melamed’s GSAalgorithm sentences and their analyses. The architecture of (GeometricSentenceAlignment;Melamed,1999). this database and the corresponding analysis and For segmentation of text we use corresponding evaluation frontend is described in (Eberle et al Lingeniotools(unpublished). 2 2010,2012).Section Results containsexamplesof ForwordalignmentGiza++(Och&Ney,2003)is suchrepresentations. the standard tool. Given a word alignment, the extraction of a (SMT) dictionary is relatively Step 2: Compile a basic dictionary for the most straightforward. With the exception of sentence frequent words segmentation, these algorithms are largely A prerequisite of the suggested hybrid approach languageindependentandcanbeusedforallofthe with rulebased kernel is to define morphological languages that we consider. We did this for a classifications for the new language(s). This is number of language pairs on the basis of the doneexploitingsimilaritiestotheclassificationsas availablefortheexistinglanguages.Currently,this 2 If these cannot be applied because of lack of has been carried out for Dutch (on the basis of information about a language, we intend to use the German) and for Spanish (on the basis of algorithm by Kiss & Strunk (2006). An opensource French/other Romance languages). The most implementationofpartsoftheKiss&Strunkalgorithm frequent words (the basic vocabulary of a is available from Patrick Tschorn at http://www.denkselbst.de/sentrick/index.html.

104 Europarltexts considered (as stored in our obtain a practical tool that can be used in an database).Inordertooptimizetheresultsweuse industrialcontext. the dictionaries of step 1 as set of cognates (cf. The basic assumption underlying the approach Simardatal1992,Gough&Way2004),aswellas is that across languages there is a correlation otherwordseasilyobtainablefromtheinternetthat between the cooccurrences of words that are canbeusedforthispurpose(likecompanynames translationsofeachother.If–forexample–ina and other named entities with crosslanguage textofonelanguagetwowords Aand Bcooccur identity and terminology translations). Using the moreoftenthanexpectedbychance,theninatext morphology component of the new language and of another language those words that are the categorial information from the transfer translationsof Aand Bshouldalsocooccurmore relation, we compute the basic forms of the frequentlythanexpected.Itisfurtherassumedthat inflectedwordsfound.Later,weintendtofurther a small dictionary (as generated in step 2) is improve the accuracy of word alignment by available at the beginning, and that the aim is to exploitingchunktypesyntacticinformationofthe expand this basic lexicon. Using a corpus of the narrow context of the words (cf. Eberle & Rapp target language, first a cooccurrence matrix is 2008).Anearlystagevariantofthisisalreadyused computedwhoserowsareallwordtypesoccurring inLingenioproducts.Thecorrespondingfunction in the corpus and whose columns are all target AutoLearnextractsnewwordrelationson wordsappearinginthebasiclexicon.Nextaword the basis of existing dictionaries and (partial) of the source language is considered whose syntacticanalyses.(Fig2givesanexample). translationistobedetermined.Usingthesource language corpus, a cooccurrence vector for this word is computed. Thenall known words in this vectoraretranslatedtothetargetlanguage.Asthe basic lexicon is small, only some of the translations are known. All unknown words are discardedfromthevectorandthevectorpositions are sorted in order to match the vectors of the targetlanguage matrix. Using standard measures for vector similarity, the resulting vector is compared to all vectors in the cooccurrence Fig 2 AutoLearn:newentriesusing matrixofthetargetlanguage.Thevectorwiththe transferlinksandsyntacticanalysis highest similarity is considered to be the translationofoursourcelanguageword. Given the relatively small size of the available Fromapreviouspilotstudy(Rapp,1999)itcan parallel corpora, we expect that the automatically be expected that this methodology achieves an generateddictionarieswillcompriseabout20,000 accuracy in the order of 70%, which means that entries each (This corresponds to first results on only a relatively modest amount of manual post the basis of German↔English). This is far too editingisrequired. small for a serious general purpose MT system. The automatically generated results are Note that, in comparison, the English↔German improved and the amount of postediting is dictionary used in the current Lingenio MT reduced by exploiting sense (disambiguation) product comprises more than 480,000 keywords information as available from the analysis andphrases. component for the 'known' language of the new language pair.. Also we try to exploit categorial Step 4: Expanding dictionaries using and underspecified syntactic information of the comparable corpora (word equations) contexts of the words similar to what has been Inordertoexpandthedictionariesusingasetof suggested for improving word alignment in the monolingual comparable corpora, the basic previousstep(seealsoFig.2).Also,asthefrequent approachpioneeredbyFung&McKeown(1997) words are already covered by the basic lexicon andRapp(1995,1999)istobefurtherdeveloped (whose production from parallel corpora on the andrefinedinthesecondphaseoftheprojectasto basisofamanuallycompiledkerneldoesnotshow

105 anambiguityproblemofsimilarsignificance),and as the availability of very large monolingual as experience shows that most low frequency corpora is no major problem for our languages. words in a fullsize lexicon tend to be However, as all steps are error prone, it can be unambiguous, the ambiguity problem is reduced expectedthataconsiderablenumberofdictionary furtherforthewordsinvestigatedandextractedby entries(e.g.50%)arenotcorrect.Tofacilitate(but thiscomparisonmethod. not eliminate) the manual verification of the dictionary, we will perform an automatic cross Step 5: Expanding dictionaries using check which utilizes the dictionaries’ property of comparable corpora (multiword units) transitivity . What we mean by this is that if we In order to account for technical terms, idioms, have two dictionaries, one translating from collocations, and typical short phrases, an languageAtolanguageB,theotherfromlanguage important feature of an MT lexicon is a high BtolanguageC,thenwecanalsotranslatefrom coverage of multiword units. Very recent work language A to C by use of the intermediate conducted at the University of Leeds (Sharoff et language(orinterlingua)B.Thatis,thepropertyof al., 2006) shows that dictionary entries for such transitivity, althoughhaving some limitations due multiword units can be derived from comparable to ambiguity problems, can be exploited to corporaifadictionaryofsinglewordsisavailable. automaticallygeneratearawdictionaryforAtoC. Itcouldevenbeshownthatthismethodologycan Lingenio has some experience with this method be superior to deriving multiwordunits from havingexploiteditforextendingandimprovingits parallel corpora (Babych et al., 2007). This is a English ↔ French dictionaries using French ↔ majorbreakthroughascomparablecorporaarefar GermanandGerman↔English. easier to acquire than parallel corpora. It even As the corpusbased approach (steps 3 to 5) opens up the possibility of building domain allows us toalso generate this typeof dictionary specificdictionariesbyusingtextsfromdifferent via comparable corpora, we have two different domains. ways to generate a dictionary for a particular Theoutlineofthealgorithmisasfollows: languagepair.Thismeansthatwecanvalidateone • Extract collocations from a corpus of the with the other. Furthermore, with increasing sourcelanguage(Smadja,1993) number of language pairs created, there are more • To translate a collocation, look up all its andmorelanguagesthatcanserveasinterlinguaor wordsusinganydictionary 'pivot': This, step by step, gives an increasing • Generate all possible permutations potentialformutualcrossvalidation. (sequences)ofthewordtranslations Specificattentionwillbepaidtoautomatingas far as possible the creation of selectional • Counttheoccurrencefrequenciesofthese restrictionstobeassignedtothetransferrelations sequences in a corpus of the target of the new dictionaries in all steps of dictionary languageandtestforsignificance creation(2–6).Wewilltrytodothisonthebasis • Consider the most significant sequence to of the analysis components as available for the be the translation of the source language languages considered: These are: a completely collocation worked out analysis component for the 'old' Of course, in later steps of the project, we will language,adeclarative(chunkparsing)component experiment on filtering these sequences by forthenewone(comparethetwofollowingsteps exploiting structural knowledge similarly to what forthis). wasdescribedinthetwoprevioussteps.Thiscan beobtainedonthebasisofthedeclarativeanalysis Step 7: Integrate dictionaries in existing component of the new language which is machine translation systems developedinparallel. Lingenio has a relatively rich infrastructure for Step 6: Cross-validate dictionaries automatic importation of various kinds of lexical informationintothedatabaseusedbytheanalyses Thecombinationofthecorpusbasedmethodsfor and translation systems. If necessary the automatic dictionary generation as described in information on hand (for instance from steps3to5willleadtohighcoveragedictionaries conventional dictionaries of publishing houses) is

106 completed and normalized during or before designwithclearlydefinedinterfacesthatmakeit importation. This may be executed completely relatively straightforward to integrate information automatically – by using the existing analyses fromcorpora. components and resources respectively as Givenaparallelcorpusasacquiredinstep1, databases – or interactively – by asking the thefollowingproceduredefinesgrammardevelop lexicographerforadditionalinformation,ifneeded. ment: Forexample,theremaybealistofmultiword expressions to be imported into the database. In 1. Define a declarative grammar for the new order to have available correct syntactic and languageandtrainthisgrammarontheparallel semantic information for these expressions, they corpusaccordingtothefollowingsteps: are analysed by the parser of the corresponding 2. Use a chunk parser for the grammar on the language.Fromtheanalysisfound,theinformation basis of an efficient partofspeech tagger for necessarytodescribethenewlemmainthelexicon thenewlanguage. with respect to semantic type and syntactic 3. Combine the chunk analyses of the sentence, structureisobtained.Thesameinformationisused according to suggestions for packed syntactic to automatically create correct restructuring structures(cf.Schiehlen2001andothers)and constraints for translation relations which use the underspecified representation structures newlemmaastarget.Iftheparserdoesnotfinda respectively (cf. Eberle, 2004, and others), sound syntactic description, for example because suchthattheresultrepresentsadisjunctionof some basic information or the expression is thepossibleanalysesofthesentence. missinginthelexicaldatabase,thelexicographeris 4. Filterthealternativesoftherepresentationby asked for the missing information or is handed usingmappingconstraintsbetweensourceand overtheexpressiontocodeitmanually. target sentence as can be computed from the Using these tools importation of new lexical lexical transfer relations and the structural information, as provided in the previous steps, is analysis of the sentence. For instance, if we considerablyaccelerated. know,asintheexampleofthelastsection,that inthesourcesentencethereisarelativeclause Step 8: Compile rule bases for new language with lexical elements A, B, . . . modifying a pairs headHandthattherearetranslationsTH,TA, Although experience clearly shows that TB,...ofH,A,B,...,inthetargetsentence constructionandmaintenanceofthedictionariesis which, among other possibilities, can be by far the most expensive task in (rulebased) supposed to stand in a similar structural Machine Translation, the grammars (analysis and relation there, then we prefer this relation to generation) must of course be developed and thecompetingstructuralpossibilities.(Fig.3in maintained also. Lingenio has longstanding section results shows the corresponding experience with the development of grammars, selectionforaGermanSpanishexampleinthe dictionariesandallothercomponentsofRBMT. projectdatabase). The used grammar formalism ( slot grammar , 5. For each of the remaining structural cf. McCord 1991) is unification based and its possibilitiesofthethusrevisedunderspecified structuringfocusesondependency,wherephrases representation, take its lexical material and areanalysedintoheadsandgrammaticalroles–so underspecified structuring as a context for its called(complementandadjunct) slots . successful firing. For instance, if the The grammar formalism and basic rule types possibilityisleftthatOisthedirectobjectof are designed in a very general way in order to VP, where VP is an underspecified verbal allow good portability from one language to phrase and O an underspecified nominal anothersuchthatspellingoutthedeclarativepart phrase(i.e.wheredetailsofthesubstructuring ofagrammardoesnottakeverymuchtime(24 are not spelled out), take the sentence as a person months approx. for relatively similar reference for direct object complementation languages like Romance languages according to and O and VP as contexts which accept this ourexperience).Theportationoflinguisticrulesto complementation. new languages is also facilitated by the modular

107 6. Develop more abstract conditions from the previousstages.Onthesourcelanguagesideofthe conditions learned according to (5) and corpus we will automatically generate lists of integratethedifferentcases. frequent multiword expressions (MWEs) and 7. Tune the results using standard methods of grammatical constructions using the methodology corpusbased linguistics. Among other things proposedin(Sharoffetal.,2006).Foreachofthe this means: Distinguish between training and identified MWEs and constructions we will test corpora, adjust weights according to the generateaparallelconcordanceusingopensource resultsoftestruns,etc. CSAR architecture developed by the Leeds team (Sharoff, 2006). The concordance will be Thebasicideaoftheproposedlearningprocedure generated by running queries to the sentence is similar to that used with respect to learning aligned parallel corpora and will return lists of lexical transfer relations: Do not define the corresponding sentences from goldstandard statisticalmodelforthe’ignorant’state,wherethe human translations and corresponding sentences surface items of the bilingual corpora are generatedbyMT.Eachoftheseconcordanceswill considered. Instead, define it for appropriate be automatically evaluated using standard MT maximally abstract analyses of the sentences evaluation metrics, such as BLEU. Under these (which, of course, must be available settings parallel concordances will be used as automatically),because,then,muchsmallersetsof standard MT evaluation corpora in an automated datawilldo.Here,theimportantquestionis:What MTevaluationscenario. isthemostabstractlevelofrepresentationthatcan NormallyBLEUgivesreliableresultsforMT bereachedautomaticallyandwhichshowsreliable corpora over 7000 words. However, in (Babych results? We think that it is the level of andHartley,2009;BabychandHartley,2008)we underspecifiedsyntacticdescriptionasusedinthe demonstrated that if the corpus is constructed in procedureabove. thiscontrolledway,whereevaluatedfragmentsof Theresultoftrainingthegrammarisasetof sentencesareselectedaslocalcontextsforspecific rules which assign weights and contexts to each multiword expressions or grammatical filler rule of the declarative grammar and thus constructions, then BLEU scores have another allowtoestimatehowlikelyitisthataparticular “island of stability” for much smaller corpora, ruleisappliedinaparticularcontextincomparison which now may consist of only five or more with other rules (Fig. 4 and 5 in section results alignedconcordancelines.Thisconcordancebased give an overview of the relevance of grammar evaluation scenario gives correct predictions of rules and their triggering conditions w.r.t. translationqualityforthelocalcontextofeachof German). theevaluatedexpressions. Wementionedthatthetaskoftranslatingtexts The scores for the evaluated MWEs and into each other does not presuppose that each constructions will be put in a riskassessment ambiguityinasourcesentenceisresolved.Onthe framework, where we will balance the frequency contrary, translation should be ambiguity ofconstructionsandtheirtranslationquality.The preserving (cf. Kay, Gawron & Norvig 1994, toppriorityreceivethemostfrequentexpressions compare the example above). It is obvious that thatarethemostproblematiconesforaparticular underspecified syntactic representations as MT engine, i.e., with queries with lowest BLEU suggested here are also especially suited for scoresfortheirconcordances.Thisframeworkwill preservingambiguitiesappropriately. allowMTdeveloperstoworkdowntheprioritylist and correct or extend coverage for those Step 9: Automatically evaluate translations of constructions which will have the biggest impact the most frequent grammatical constructions onMTquality. and multiword expressions in a machine- translated corpus Step 10: Extend support for high-priority constructions semi-automatically by mining Inalaterworkpackageoftheproject,wewillrun correct translations from parallel corpora a large parallel corpus through available (competitive)MTengines,whichwillbeenhanced At this stage we will automate the procedure of by automatic dictionaries developed during the correcting errors and extending coverage for

108 problematic MWEs and grammatical The basic dictionaries have been compiled constructions,identifiedinStep9.Forthiswewill manually(Dutch)orextractedfromaconventional exploit alignment between sourcelanguage electronicdictionary( translateDict Spanish). sentencesandgoldstandardhumantranslations.In For a subset of the Spanish corpus (reference the target human translations we will identify sentencesofthegrammar,partsoftheopensource linguisticallymotivated multiword expressions, Leeds corpus (Sharoff, 2006) , and Europarl), e.g., using partofspeech patterns or tfidf syntacticanalyseshavebeencomputedandstored distribution templates (Babych et al., 2007) and inthedatabase.Asthenumberofanalysesgrows run standard alignment tools (e.g., GIZA++) for extremely with the length of sentences, only finding the most probable candidate MWEs that relatively short sentences (up to 15 words) have correspond to the problematic sourcelanguage been considered. These analyses are currently expressions. Source and target MWEs paired in compared to the analyses of the German this way will form the basis for automatically translations of the corresponding sentences (one generatedgrammarrules.Theruleswillnormally translationpersentence),whicharetakenasakind generalise several pairs of MWEs, and may be of 'gold' standard as the German analysis underspecifiedforcertainlexicalormorphological component(aspartofthetranslationproducts)has features.Latersuchruleswillbemanuallychecked proventobesufficientlyreliable.Onthebasisof and corrected by language specialists in MT the comparison a preference on the competitive development teams that work on specific analyses of the Spanish sentence is entailed and translationdirections. used for defining a statistical evaluation Thisprocedurewillallowtospeedupthegrammar componentfortheSpanishgrammar.Fig.3shows developmentprocedureforlargescaleMTprojects the corresponding representations in the database andwillfocusongrammaticalconstructionswith for the sentence Aumenta la demana de energía the highest impact on MT quality, establishing eléctrica por la ola de calor 3anditstranslation die them as a top priority for MT developers. In Nachfrage nach Strom steigt wegen der HyghTra and with respect to the languages Hitzewelle/the demand for electricity increases consideredthere,thisprocedurewillbeintegrated because of the heat-wave. intothegrammardevelopmentandoptimizationof step8,inparticularitwillberelatedtostep4of the procedure sketched there. With regard to integration,weaimataninterleavedarchitecturein thelongrun. Step 11: Bootstrap the system In Step11, the new grammar and thetransfer of thenewMTsystemandthenewdictionarymaybe mutuallytrainedfurtherusingthestepsbeforeand applyingthesystemtoadditionalcorpora. Fig.3 Selection of analyses via correspondences (prefer first Spanish analysis because of subj -congruity) 4 Results Theanalysesareassociatedwiththecorresponding creationprotocols,whicharestructuredlistswhose DeclarativeslotgrammarsforDutchandSpanish itemsdescribe,viatheidentifiers,whichrulehas havebeendevelopedusingthepatternsofGerman been applied when and to what structures in the andFrench–where declarative meansthatthere processofcreatingtheanalysis.Fromtheselection has been used no relevant semantic or other ofabestanalysisforasentence,wecanentailthe information in order to spell out weighting or circumstances under which the application of filters for rule application the only constraint particularrulesarepreferred.Thishasbeencarried beingmorphosyntacticaccessibility.Thenecessary morphological information has been adapted 3 similarlyfromthecorrespondingmodellanguages. Sentencetakenfromtheonlinenewspaper El Día de Concepción del Uruguay

109 outnotyetforthe'new'languageSpanish,butfor 5 Conclusions the'known'languageGerman,inordertoobtaina measureabouthowcorrectlytheexistinggrammar Theprojecttriestomakestateoftheartstatistical evaluation component can be replaced by the methodsavailablefordictionarydevelopmentand resultsofthecorrespondingstatisticalstudy. grammardevelopmentforarulebaseddominated industrial setting and to exploit such methods there. WithregardtoSMTdictionarycreation,itgoes beyondthecurrentstateoftheartasitalsoaimsat developingandapplyingalgorithmsforthesemi automaticgenerationofbilingualdictionariesfrom unrelated monolingual (i.e., comparable) corpora of the source and the target language, instead of using relatively literally translated (i.e., parallel) Fig.4 Frequencyofapplicationsofrules texts only. Comparable corpora are far easier to obtain than parallel corpora. Therefore the approach offers a solution to the serious data cluster similarity feas mod feas head acquisition bottleneck in SMT. This approach is applications also more cognitively plausible than previous 383,384,.. 0,86 sent,... emosentaffv,.. suggestionsonthistopic,sincehumanbilinguality 557,558,566,.. 0,68 denselb,.. gebv,... isnormallynotbasedonmemorizingparalleltexts. Oursuggestionmodelshumancapacitytotranslate Fig.5 Preliminaryconstraintsrelatedtogrammar texts using linguistic knowledge acquired from ruleclusters monolingual data, so it also exemplifies many more features of a truly selflearning MT system Fig.4showsthedistributionofruleusageswithin (sharedalsobyahumantranslator). the training set of analyses (of approx.30.000 In addition, the proposal suggests a new sentences). 390 different rules were used with a methodforspellingoutgrammarsandparsersfor totalof133708ruleapplications.The subject rule languages by splitting grammars into declarative (383)andthenoundeterminerrule(46)themost kernels and trainable decision algorithms and by used rules (35% of all applications). Fig 5. exploiting crosslinguistic knowledge for illustrates the preliminary results of a clustering optimizingtheresultsofthecorrespondingparsers. algorithm where different rule applications are For developing different components and grouped into clusters and the key features of the dictionaries for the system a bootstrapping head and modifier phrases for each cluster are architecture is suggested that uses the acquired extracted. lexicalinformationfortrainingthegrammarofthe Currently,wetrytodeterminefurtherandtare new language, which in turn uses the the linguistic features and the weighting which (underspecified) parser results for optimizing the modelsbesttheevaluationforGerman.(Thegold lexicalinformationinthecorrespondingtranslation standard that is used in this test is the set of dictionaries.Weexpectthatthesuggestedmethods analysesmentionedabove).Theinvestigationsare significantly improve translation quality and not yet completed, but preliminary results on the reducethecostsofcreatingnewlanguagepairsfor basis of the morphosyntactic and semantic Machine Translation. The preliminary results properties of the neighboring elements are obtainedsofarintheprojectappearpromising. promising.Afterconsolidation,thefindingswillbe transferredtoSpanishonthebasisoftheselection 6 Acknowledgments procedure illustrated in Fig. 3. The next step of ThisresearchissupportedbyaMarieCurieIAPP grammar training in the immediate future will project taking place within the 7th European consist of changing the focus to underspecified Community Framework Programme (Grant analysesasdescribedinstep8 agreementno.:251534)

110 7 References Workshop on Web Services and Processing Pipelines in HLT of LREC-2010 ,Valetta. Armstrong,S.;Kempen,M.;McKelvie,D.;Petitpierre,D.; Eberle,K.;Eckart,K.,Heid,U.,Haselbach,B.(2012)A Rapp,R.;Thompson,H.(1998).MultilingualCorpora tool/databaseinterfaceformultilevelanalyses. forCooperation. Proceedings of the 1st International Proceedings of LREC-2012 ,Istanbul. Conference on Linguistic Resources and Evaluation Frederking,R.;Nirenburg,S.;Farwell,D.;Helmreich,S.; (LREC) ,Granada,Vol.2,975–980. Hovy,E.;Knight,K.;Beale,S.;Domashnev,C.; Babych,B.,Hartley,A.,SharoffS.;Mudraya,O.(2007). Attardo,D.;Grannes,D.;Brown,R.(1994).Integrated AssistingTranslatorsinIndirectLexicalTransfer. TranslationfromMultipleSourceswithinthePangloss Proceedingsofthe45thAnnualMeetingoftheACL. MARKIIMachineTranslationSystem. Proceedings Babych,B.,AnthonyHartley,&SergeSharoff(2007b) of Machine Translation of the Americas ,73–80. Translatingfromunderresourcedlanguages: Frederking,RobertandSergeiNirenburg(1994).Threeheads comparingdirecttransferagainstpivottranslation. arebetterthanone.In:ProceedingsofANLP94, ProceedingsofMTSummitXI,1014September Stuttgart,Germany. 2007,Copenhagen,Denmark,2935 Fung,P.;McKeown,K.(1997).Findingterminology Babych,B.&Hartley,A.(2008).AutomatedMTEvaluation translationsfromnonparallelcorpora . Proceedings of forErrorAnalysis:AutomaticDiscoveryofPotential the 5th Annual Workshop on Very Large Corpora, TranslationErrorsforMultiwordExpressions.ELRA HongKong:August1997,192202. WorkshoponEvaluation“LookingintotheFutureof Gale,W.A.;Church,K.W.(1993).Aprogrmforaligning Evaluation:Whenautomaticmetricsmeettaskbased sentencesinbilingualcorpora. Computational andperformancebasedapproaches”.Marrakech, Linguistics, 19(1),75–102. Morocco27May2008. Proceedings of LREC’08. González,J.;AntonioL.Lagarda,JoséR.Navarro,Laura Babych,B.andHartley,A.(2009).Automatederroranalysis Eliodoro,AdriàGiménez,FranciscoCasacuberta,Joan formultiwordexpressions:usingBLEUtypescores M.deValandFerranFabregat(2004).SisHiTra:A forautomaticdiscoveryofpotentialtranslationerrors. SpanishtoCatalanhybridmachinetranslationsystem. LinguisticaAntverpiensia,NewSeries(8/2009): Berlin:SpringerLNCS. Journaloftranslationandinterpretingstudies.Special IssueonEvaluationofTranslationTechnology. Gough,N.,Way,A.(2004).ExampleBasedControlled Translation. Proceedings of the Ninth Workshop of the Babych,B.,Babych,S.andEberle,K.(2012).Deriving European Association for Machine Translation , generationorientedMTresourcesfromcorpora:case Valetta,Malta. studyandevaluationofde/hetclassificationforDutch Noun(inpreparation) Groves,D.&Way,A.(2006b).HybridityinMT:Experiments ontheEuroparlCorpus.In Proceedings of the 11th Baroni,M.;Bernardini,S.(2004).BootCaT:Bootstrapping Conference of the European Association for Machine corporaandtermsfromtheweb.Proceedingsof Translation ,Oslo,Norway,115–124. LREC2004. Groves,D.;Way,A.(2006a).Hybriddatadrivenmodelsof CallisonBurch,C.,MilesOsborne,&PhilippKoehn:Re machinetranslation. Machine Translation, 19(3–4). evaluatingtheroleofBLEUinmachinetranslation SpecialIssueonExampleBasedMachineTranslation. research.EACL2006:11thConferenceofthe 301–323. EuropeanChapteroftheAssociationfor ComputationalLinguistics,Trento,Italy,April37, Habash,N.;Dorr,B.(2002).Handlingtranslation 2006;pp.249256 divergences:Combiningstatisticalandsymbolic techniquesingenerationheavymachinetranslation. Charniak,E.;Knight,K.;Yamada,K.(2003).Syntaxbased Proceedings of AMTA-2002, Tiburon,California, languagemodelsforstatisticalmachinetranslation". USA. ProceedingsofMTSummitIX. Kiss,T.;Strunk,J.(2006):Unsupervisedmultilingual Eberle,Kurt(2001).FUDRbasedMT,headswitchingandthe sentenceboundarydetection.Computational lexicon. Proceedings of the the eighth Machine Linguistics32(4),485–525. Translation Summit, SantiagedeCompostela. Koehn,P.(2005).Europarl: A Parallel Corpus for Statistical Eberle,Kurt(2004). Flat underspecified representation and its Machine Translation .ProceedingsofMTSummitX, meaning for a fragment of German. Phuket,Thailand Habilitationsschrift,UniversitätStuttgart. Koehn,P.;Knight,K.(2002).Learningatranslationlexicon Eberle,K.;Rapp,R.(2008).RapidConstructionof frommonolingualcorpora.In: Proceedings of ACL-02 ExplicativeDictionariesUsingHybridMachine Workshop on Unsupervised Lexical Acquisition , Translation. In: Storrer, A.; Geyken, A.; Siebert, A.; PhiladelphiaPA. Würzner, K._M (eds.) Text Resources and Lexical Knowledge: Selected Papers from the 9th Conference LanguageIndustryMonitor(1992).Statisticalmethods on Natural Language Processing KONVENS 2008. gainingground.In: Language Industry Monitor , Berlin: Mouton de Gruyter. . September/October1992issue. Eckart,K.,Eberle,K.;Heid,U.(2010)Aninfrastructurefor morereliablecorpusanalysis. Proceedings of the

111 McCord,M.(1989).Anewversionofthemachinetranslation translators.In:ProceedingsofCOLING/ACL2006, systemLMT. Journal of Literary and Linguistic 739–746. Computing, 4 ,218–299. Sharoff,S.(2006).Auniforminterfacetolargescale McCord,M.(1991).Theslotgrammarsystem. In: Wedekind, linguisticresources.InProceedingsoftheFifth J., Rohrer, C.(eds): Unification in Grammar, MIT LanguageResourcesandEvaluationConference, Press. LREC2006,Genoa. Melamed,I.Dan(1999).Bitextmapsandaligmentviapattern Simard,M.,Foster,G.,Isabelle,P.(1992).UsingCognatesto recognition.Computational Linguistics ,25(1),107– AlignSentencesinBilingualCorpora.P roceeedings of 130. the International Conference on Theoretical and Munteanu,D.S.;Marcu,D.(2005).Improvingmachine Methodological Issues ,Montréal. translationperformancebyexploitingnonparallel Smadja,F.(1993).Retrievingcollocationsfromtext:Xtract. corpora. Computational Linguistics ,31(4),477–504. Computational Linguistics ,19(1),143–177. Och,F.J.;Ney,H.(2002).Discriminativetrainigand Streiter,O.,Carl,M.,Haller,J.(eds)(1999). Hybrid maximumentropymodelsforstatisticalmachine Approaches to Machine Translation .IAIworking translation.Proceedings of the Annual Meeting of the papers36. Association for Computational Linguistics , Streiter,O.;Carl,M.;Iomdin,L.L.:2000,AVirtual Philadelphia,PA,295–302. TranslationMachineforHybridMachineTranslation'. Och,F.J.;Ney,H.(2003).Asystematiccomparisonofvarious In: Proceedings of the Dialogue'2000 International statisticalalignmentmodels. Computational Seminar in Computational Linguistics and Linguistics, 29(1),19–51. Applications .Tarusa,Russia. Papineni,K.;Roukos,S.;Ward,T.;Zhu,W.(2002).BLEU:A Streiter,O.;Iomdin,L.L.(2000).LearningLessonsfrom methodforautomaticevaluationofmachine BilingualCorpora:BenefitsforMachineTranslation. translation.In: Proceedings of the 40th Annual InternationalJournalofCorpusLinguistics,5(2),199– Meeting of the ACL ,Philadelphia,PA,311–318. 230. Rapp,R.(1995).Identifyingwordtranslationsinnonparallel Thurmair,G.(2005).Hybridarchitecturesformachine texts.In: Proceedings of the 33rd Meeting of the translationsystems. Language Resources and Association for Computational Linguistics. Evaluation, 39(1),91–108. Cambridge,MA,1995,320–322 Thurmair,G.(2006).Usingcorpusinformationtoimprove Rapp,R.(1999).Automaticidentificationofwordtranslations MTquality. Proceedings of the LR4Trans-III fromunrelatedEnglishandGermancorpora.In: Workshop ,LREC,Genova. Proceedings of the 37th Annual Meeting of the Asso- Thurmair,G.(2007)AutomaticevaluationinMTsystem ciation for Computational Linguistics 1999, College production.MTSummitXIWorkshop:Automatic Park, Maryland. 519–526. proceduresinMTevaluation,11September2007, Rapp,R.(2004).Afreelyavailableautomaticallygenerated Copenhagen,Denmark, thesaurusofrelatedwords.In: Proceedings of the Veronis,Jean(2006). Technologies du Langue. Actualités – Fourth International Conference on Language Comentaires – Réflexions. Translation. Systran or Resources and Evaluation (LREC) ,Lisbon, Vol.II, Reverso? 395–398. http://aixtal.blogspot.com/2006/01/translationsystran Rapp,R.;MartinVide,C.(2007).Statisticalmachine orreverso.html translationwithoutparallelcorpora.In:GeorgRehm, Wu,D.,Fung,P.(2005).Inversiontransductiongrammar AndreasWitt,LotharLemnitzer(eds.): Data constraintsforminingparallelsentencesfromquasi Structures for Linguistic Resources and Applications. comparablecorpora. Second International Joint Proceedings of the Biennial GLDV Conference 2007 . Conference on Natural Language Processing Tübingen:GunterNarr.231–240 (IJCNLP-2005) .Jeju,Korea. Resnik,R.(1999).Miningthewebforbilingualtext. Proceedingsofthe37thAnnualMeetingofthe AssociationforComputationalLinguistics. Sato,S.;Nagao,M.(1990).Towardmemorybased translation. Proceedings of COLING 1990 ,247 −252. Schiehlen,M.(2001)SyntacticUnderspecification.In:Special ResearchArea340–Finalreport,Universityof Stuttgart. Sharoff,S.(2006)Opensourcecorpora:usingthenettofish forlinguisticdata.InInternationalJournalofCorpus Linguistics11(4),435–462. Sharoff,S.;Babych,B.;Hartley,A.(2006).Usingcomparable corporatosolveproblemsdifficultforhuman

112 Can Machine Learning Algorithms Improve Phrase Selection in Hybrid Machine Translation?

Christian Federmann Language Technology Lab German Research Center for Artificial Intelligence Stuhlsatzenhausweg 3, D-66123 Saarbrucken,¨ GERMANY [email protected]

Abstract hybrid translations and substitute noun phrases1 by translations from one or several MT engines2. We describe a substitution-based, hybrid Even though a general increase in quality could be machine translation (MT) system that has observed in previous work, our system introduced been extended with a machine learning errors of its own during the substitution process. component controlling its phrase selection. In an internal error analysis, these degradations Our approach is based on a rule-based MT could be classified in the following way: (RBMT) system which creates template translations. Based on the generation parse - external translations were incorrect; tree of the RBMT system and standard word alignment computation, we identify potential “translation snippets” from one or - the structure degraded through substitution; more translation engines which could be substituted into our translation templates. - phrase substitution failed. The substitution process is controlled by a binary classifier trained on feature vectors Errors of the first class cannot be corrected, as we from the different MT engines. Using a set do not have an easy way of knowing when the of manually annotated training data, we are translation obtained from an external MT engine able to observe improvements in terms of is incorrect. The other classes could, however, be BLEU scores over a baseline version of the eliminated by introducing additional steps for pre- hybrid system. and post-processing as well as by improving the hybrid substitution algorithm itself. So far, our 1 Introduction algorithm relied on many, hand-crafted decision factors; in order to improve translation quality and In recent years, the overall quality of machine processing speed, we decided to apply machine translation output has improved greatly. Still, learning methods to our training data to train a each technological paradigm seems to suffer from linear classifier which could be used instead. its own particular kinds of errors: statistical MT This paper is structured in the following way. (SMT) engines often show poor syntax, while After having introduced the topics of our work in rule-based MT systems suffer from missing data Section 1, we give a description of our hybrid MT in their vocabularies. Hybrid approaches try to system architecture in Section 2. Afterwards we overcome these typical errors by combining tech- describe in detail the various decision factors we niques from both (or even more) paradigms in an 1We are focusing on noun phrases for the moment as optimal manner. these worked best in previous experiments with substitution- In this paper we report on experiments with an based MT; likely because they usually form consecutive spans in the translation output. extended version of the hybrid system we develop 2 in our group (Federmann and Hunsicker, 2011; While this could be SMT systems only, our approach supports engines from all MT paradigms. If not all features Federmann et al., 2010). We take the output from inside our feature vectors can be filled using the output of an RBMT engine as “translation template” for our some system X, we use defaults as fallback values.

113 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 113–118, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics have defined and how these could be used in fea- ture vectors for machine learning methods in Sec- tion 3. Our experiments with the classifier-based, hybrid MT system are reported in Section 4. We conclude by giving a summary of our work and then provide an outlook to related future work in Section 5.

2 Architecture Our hybrid machine translation system combines translation output from:

a) the Lucy RBMT system, described in more detail in (Alonso and Thurmair, 2003), and

b) one or several other MT systems, e.g. Moses (Koehn et al., 2007), or Joshua (Li et al., 2009). Figure 1: Schematic overview of the architecture of our substitution-based, hybrid MT system. The rule-based component of our hybrid system is described in more detail in section 2.2 while we provide more detailed information on the “other” structures for each of the translation phases can systems in section 2.3. be extracted from the Lucy system to guide the hybrid system. Only the 1-best path through the 2.1 Basic Approach three phases is given, so no alternative translation We first identify noun phrases inside the rule- possibilities can be extracted from the given data; based translation and compute the most proba- a fact that clearly limits the potential for more ble correspondences in the translation output from deeply integrated hybrid translation approaches. the other systems. For the resulting phrases, we Nonetheless, the availability of these 1-best trees apply a factored substitution method that decides already allowed us to improve the translation whether the original RBMT phrase should be kept quality of the RBMT system as we had shown in or rather be replaced by one of the candidate previous work. phrases. As this shallow substitution process may introduce errors at phrase boundaries, we perform 2.3 Substitution Candidate Translations several post-processing steps that clean up and We use state-of-the-art SMT systems to create finalise the hybrid translation result. A schematic statistical, phrase-based translations of our input overview of our hybrid system and its main com- text, together with the bidirectional word align- ponents is given in figure 1. ments between the source texts and the transla- 2.2 Rule-Based Translation Templates tions. Again, we make use of markup which helps to identify unknown words as this will later be We obtain the “translation template” as well as useful in the factored substitution method. any linguistic structures from the RBMT system. Translation models for our SMT systems were Previous work with these structures had shown trained with lower-cased and tokenised Europarl that they are usually of a high quality, supporting (Koehn, 2005) training data. We used the LDC our initial decision to consider the RBMT output Gigaword corpus to train large scale language as template for our hybrid translation approach. models and tokenised the source texts using the The Lucy translation output can include markup tokenisers available from the WMT shared task that allows to identify unknown words or other website3. All translations are re-cased before they phenomena. are sent to the hybrid system together with the The Lucy system is a transfer-based RBMT word alignment information. system that performs translation in three phases, namely analysis, transfer, and generation. Tree 3Available at http://www.statmt.org/wmt12/

114 The hybrid MT system can easily be adapted 2.5 Factored Substitution to support other translation engines. If there is no Given the results of the alignment process, we alignment information available directly, a word can then identify “interesting” phrases for substi- alignment tool is needed as the alignment is a tution. Following our experimental setup from the key requirement for the hybrid system. For part- WMT10 shared task, we again decided to focus of-speech tagging and we used the on noun phrases as these seem to be best-suited TreeTagger (Schmid, 1994). for in-place swapping of phrases. 2.4 Aligning RBMT and SMT Output To avoid errors or problems with non-matching We compute alignment in several components of insertions, we want to keep some control on the the hybrid system, namely: substitution process. As the substitution process proved to be a very difficult task during previous source-text-to-tree: we first find an alignment experiments with the hybrid system, we decided between the source text and the correspond- to use machine learning methods instead. For this, ing analysis tree. As Lucy tends to subdivide we refined our previously defined set of decision large sentences into several smaller units, it factors into values v ∈ R which allows to com- sometimes becomes necessary to align more bine them in feature vectors xi = v1 . . . vp. We than one tree structure to a source sentence. describe the integration of the linear classifier in analysis-transfer-generation: for each of the more detail in Section 3. analysis trees, we re-construct the path from 2.6 Decision Factors its tree nodes, via the transfer tree, to the corresponding generation tree nodes. We used the following factors: tree-to-target-text: similarly to the first align- 1. frequency: frequency of a given candidate ment process, we find a connection between phrase compared to total number of candi- generation tree nodes and the corresponding dates for the current phrase; translation output of the RBMT system. source-text-to-tokenised: as the Lucy RBMT 2. LM(phrase): language model (LM) score of system works on non-tokenised input text the phrase; and our SMT systems take tokenised input, we need to align the original source text with 3. LM(phrase)+1: phrase with right-context; its tokenised form. 4. LM(phrase)-1: phrase with left-context; Given the aforementioned alignments, we can then correlate phrases from the rule-based trans- 5. Part-of-speech match?: checks if the part- lation with their counterparts from the statistical of-speech tags of the left/right context match translations, both on source or target side. As the current candidate phrase’s context; our hybrid approach relies on the identification of such phrase pairs, the computation of the differ- 6. LM(pos) LM score for part-of-speech (PoS); ent alignments is critical to achieve a good system combination quality. 7. LM(pos)+1 PoS with right-context; All tree-based alignments can be computed with a very high accuracy. However, due to the 8. LM(pos)-1 PoS with left-context; nature of statistical word alignment, the same does not hold for the alignment obtained from the 9. Lemma checks if the lemma of the candidate SMT systems. If the alignment process produces phrase fits the reference; erroneous phrase tables, it is very likely that Lucy phrases and their “aligned” SMT matches simply 10. LM(lemma) LM score for the lemma; do not fit the “open slot” inside the translation template. Or put the other way round: the better 11. LM(lemma)+1 lemma with right-context; the underlying SMT word alignment, the greater the potential of the hybrid substitution approach. 12. LM(lemma)-1 lemma with left-context.

115 2.7 Post-processing Steps Algorithm 1 Decoding using linear classifier After the hybrid translation has been computed, 1: good candidates ← [] 2: for all substitution candidates C do we perform several post-processing steps to clean i 3: if CLASSIFY(Ci) == “good” then up and finalise the result: 4: good candidates ← Ci 5: end if cleanup first, we perform some basic cleanup 6: end for such as whitespace normalisation; 7: Cbest ← SELECT-BEST(good candidates) 8: SUBSTITUTE-IN(Cbest) multi-words then, we take care of multi-word expressions. Using the tree structures from the RBMT system we remove superfluous We first collect all “good” translations using the whitespace and join multi-words, even if CLASSIFY() operation, then choose the “best” they were separated in the substituted phrase; candidate for substitution with SELECT-BEST(), and finally integrate the resulting candidate prepositions finally, prepositions are checked as phrase into the generated translation using experience from previous work had shown SUBSTITUTE-IN(). SELECT-BEST() could that these contributed to a large extent to the use system-specific confidences obtained during amount of avoidable errors. the tuning phase of our hybrid system. We are still experimenting on its exact definition. 3 Machine Learning-based Selection 4 Experiments Instead of using hand-crafted decision rules in the substitution process, we aim to train a classifier on In order to obtain initial experimental results, we a set of annotated training examples which may be created a decision-tree-based variant of our hy- better able to extract useful information from the brid MT system. We implemented a decision tree various decision factors. learning module following the CART algorithm (Breiman et al., 1984). We opted for this solution 3.1 Formal Representation as decision trees represent a straightforward first Our training set D can be represented formally as step when it comes to integrating machine learn- ing into our hybrid system. p n D = {(xi, yi)|xi ∈ R , yi ∈ {−1, 1}}i=1 (1) 4.1 Generating Training Data where each xi represents the feature vector for For this, we first created an annotated data set. In sentence i while the yi value contains the anno- a nutshell, we computed feature vectors and po- tated class information. We use a binary classifi- tential substitution candidates for all noun phrases cation scheme, simply defining 1 as “good” and in our training data4 and then collected data from −1 as “bad” translations. In order to make use of human annotators which of the substitution candi- machine learning methods such as decision trees dates were “good” translations and which should (Breiman et al., 1984), SVMs (Vapnik, 1995), or rather be considered “bad” examples. We used the Perceptron (Rosenblatt, 1958) algorithm, we Appraise (Federmann, 2010) for the annotation, have to prepare our training set with a sufficiently and collected 24,996 labeled training instances large number of annotated training instances. We with the help of six human annotators. Table 1 give further details on the creation of an annotated gives an overview of the data sets characteristics. training set in section 4.1.

3.2 Creating Hybrid Translations Translation Candidates Using suitable training data, we can train a binary Total “good” “bad” classifier (using either a decision tree, an SVM, or Count 24,996 10,666 14,330 the Perceptron algorithm) that can be used in our hybrid combination algorithm. Table 1: Training data set characteristics The pseudo-code in Algorithm 1 illustrates how such a classifier can be used in our hybrid MT 4We used the WMT12 “newstest2011” development set decoder. as training data for the annotation task.

116 Hybrid Systems Baseline Systems

Baseline +Decision Tree Lucy Linguatec Moses Joshua

BLEU 13.9 14.2 14.0 14.7 14.6 15.9 BLEU-cased 13.5 13.8 13.7 14.2 13.5 14.9 TER 0.776 0.773 0.774 0.775 0.772 0.774

Table 2: Experimental results comparing baseline hybrid system using hand-crafted decision rules to a decision- tree-based variant; both applied to the WMT12 “newstest2012” test set data for language pair English→German.

4.2 Experimental Results 5 Conclusion and Outlook Using the annotated data set, we then trained a In this paper, we reported on experiments aiming decision tree and integrated it into our hybrid sys- to improve the phrase selection component of a tem. To evaluate translation quality, we created hybrid MT system using machine learning. We translations of the WMT12 “newstest2012” test described the architecture of our hybrid machine set, for the language pair English→German, with translation system and its main components. a) a baseline hybrid system using hand-crafted de- We explained how to train a decision tree based cision rules and b) an extended version of our hy- on feature vectors that emulate previously used, brid system using the decision tree. hand-crafted decision factors. To obtain training Both hybrid systems relied on a Lucy trans- data for the classifier, we manually annotated a lation template and were given additional trans- set of 24,996 feature vectors and compared the lation candidates from another rule-based sys- decision-tree-based, hybrid system to a baseline tem (Aleksic and Thurmair, 2011), a statistical version. We observed improved BLEU scores system based on the Moses decoder, and a sta- for the language pair English→German on the tistical system based on Joshua. If more than one WMT12 “newstest2012” test set. “good” translation was found, we used the hand- Future work will include experiments with crafted rules to determine the single, winning other machine learning classifiers such as SVMs. translation candidate (implementing SELECT- It will also be interesting to investigate what other BEST in the simplest, possible way). features can be useful for training. Also, we Table 2 shows results for our two hybrid sys- intend to experiment with heterogeneous feature tem variants as well as for the individual base- sets for the different source systems (resulting in line systems. We report results from automatic large but sparse feature vectors), adding system- BLEU (Papineni et al., 2001) scoring and also specific annotations from the various systems and from its case-sensitive variant, BLEU-cased. will investigate their performance in the context of hybrid MT systems. 4.3 Discussion of Results We can observe improvements in both BLEU Acknowledgments and BLEU-cased scores when comparing the This work has been funded under the Seventh decision-tree-based hybrid system to the baseline Framework Programme for Research and Tech- version relying on hand-crafted decision rules. nological Development of the European Commis- This shows that the extension of the hybrid sys- sion through the T4ME contract (grant agreement tem with a learnt classifier can result in improved no.: 249119). The author would like to thank translation quality. Sabine Hunsicker and Yu Chen for their support in On the other hand, it is also obvious, that the creating the WMT12 translations, and is indebted improved hybrid system was not able to outper- to Herve´ Saint-Amand for providing help with the form the scores of some of the individual base- automated metrics scores. Also, we are grateful to line systems; there is additional research required the anonymous reviewers for their valuable feed- to investigate in more detail how the hybrid ap- back and comments. proach can be improved further.

117 References evaluation of machine translation. IBM Research Report RC22176(W0109-022), IBM. Vera Aleksic and Gregor Thurmair. 2011. Personal F. Rosenblatt. 1958. The Perceptron: A probabilistic translator at wmt2011. In Proceedings of the Sixth model for information storage and organization in Workshop on Statistical Machine Translation, pages the brain. Psychological Review, 65:386–408. 303–308, Edinburgh, Scotland, July. Association Helmut Schmid. 1994. Probabilistic part-of-speech for Computational Linguistics. tagging using decision trees. In Proceedings of the Juan A. Alonso and Gregor Thurmair. 2003. The International Conference on New Methods in Lan- Comprendium Translator system. In Proceedings guage Processing, Manchester, UK. of the Ninth Machine Translation Summit, New Or- leans, USA. Toby Segaran. 2007. Programming Collective In- telligence: Building Smart Web 2.0 Applications. L. Breiman, J. Friedman, R. Olshen, and C. Stone. O’Reilly, Beijing. 1984. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA. V. N. Vapnik. 1995. The nature of statistical learning theory Christian Federmann and Sabine Hunsicker. 2011. . Springer, New York. Stochastic parse tree selection for an existing rbmt system. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 351–357, Edinburgh, Scotland, July. Association for Compu- tational Linguistics. Christian Federmann, Andreas Eisele, Yu Chen, Sabine Hunsicker, Jia Xu, and Hans Uszkoreit. 2010. Further experiments with shallow hybrid mt systems. In Proceedings of the Joint Fifth Work- shop on Statistical Machine Translation and Met- ricsMATR, pages 77–81, Uppsala, Sweden, July. Association for Computational Linguistics. Christian Federmann. 2010. Appraise: An open- source toolkit for manual phrase-based evaluation of translations. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Ros- ner, and Daniel Tapias, editors, Proceedings of the Seventh International Conference on Language Re- sources and Evaluation (LREC’10), Valletta, Malta, may. European Language Resources Association (ELRA). Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL Demo and Poster Ses- sions, pages 177–180, Jun. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the MT Summit 2005. Zhifei Li, Chris Callison-Burch, Chris Dyer, San- jeev Khudanpur, Lane Schwartz, Wren Thornton, Jonathan Weese, and Omar Zaidan. 2009. Joshua: An open source toolkit for parsing-based machine translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 135–139, Athens, Greece, March. Association for Computa- tional Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. Bleu: a method for automatic

118 Linguistically-Augmented Bulgarian-to-English Statistical Machine Translation Model

Rui Wang Petya Osenova and Kiril Simov Language Technology Lab Linguistic Modelling Department, IICT DFKI GmbH Bulgarian Academy of Sciences Saarbrucken,¨ Germany Sofia, Bulgaria [email protected] {petya,kivs}@bultreebank.org

Abstract the contrary, are in general more robust, but some- times output ungrammatical sentences. In this paper, we present our linguistically- In fact, instead of competing with each other, augmented statistical machine translation there is also a line of research trying to com- model from Bulgarian to English, which bine the advantages of the two sides using a combines a statistical machine translation (SMT) system (as backbone) with deep lin- hybrid framework. Although many systems guistic features (as factors). The motiva- can be put under the umbrella of “hybrid” sys- tion is to take advantages of the robust- tems, there are various ways to do the combi- ness of the SMT system and the linguis- nation/integration. Thurmair (2009) summarized tic knowledge of morphological analysis several different architectures of hybrid systems and the hand-crafted grammar through sys- using SMT and RBMT systems. Some widely tem combination approach. The prelimi- used ones are: 1) using an SMT to post-edit the nary evaluation has shown very promising results in terms of BLEU scores (38.85) and outputs of an RBMT; 2) selecting the best trans- the manual analysis also confirms the high lations from several hypotheses coming from dif- quality of the translation the system deliv- ferent SMT/RBMT systems; and 3) selecting the ers. best segments (phrases or words) from different hypotheses. For the language pair Bulgarian-English, there 1 Introduction has not been much study on it, mainly due to the In the recent years, machine translation (MT) lack of resources, including corpora, preproces- has achieved significant improvement in terms sors, etc. There was a system published by Koehn of translation quality (Koehn, 2010). Both et al. (2009), which was trained and tested on the data-driven approaches (e.g., statistical MT European Union law data, but not on other do- (SMT)) and knowledge-based (e.g., rule-based mains like news. They reported a very high BLEU MT (RBMT)) have achieved comparable results score (Papineni et al., 2002) on the Bulgarian- shown in the evaluation campaigns (Callison- English translation direction (61.3), which in- Burch et al., 2011). However, according to the spired us to further investigate this direction. human evaluation, the final outputs of the MT sys- In this paper, we focus on the Bulgarian-to- tems are still far from satisfactory. English translation and mainly explore the ap- Fortunately, recent error analysis shows that the proach of annotating the SMT baseline with lin- two trends of the MT approaches tend to be com- guistic features derived from the preprocessing plementary to each other, in terms of the types and hand-crafted grammars. There are three mo- of the errors they made (Thurmair, 2005; Chen et tivations behind our approach: 1) the SMT base- al., 2009). Roughly speaking, RBMT systems of- line trained on a decent amount of parallel cor- ten have missing lexicon and thus lack of robust- pora outputs surprisingly good results, in terms of ness, while handling linguistic phenomena requir- both statistical evaluation metrics and preliminary ing syntactic information better. SMT systems, on manual evaluation; 2) the augmented model gives 119 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 119–128, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics us more space for experimenting with different sentences are longer than the English ones. linguistic features without losing the ‘basic’ ro- The ratio is 1.34. Then we calculated the ra- bustness; 3) the MT system can profit from con- tio for each pair of sentences. After this, the tinued advances in the development of the deep optimal interval was manually determined, grammars thereby opening up further integration such that if the ratio for a given pair of sen- possibilities. tences is within the interval, then we assume The rest of the paper will be organized as fol- that the pair is a good one. The interval for lows: Section 2 presents our work on cleaning these experiments is set to [0.7; 1.8]. All the the corpora and Section 3 briefly describes the pairs with ratio outside of the interval have preprocessing of the data. Section 4 introduces been deleted. Similarly, we have cleaned our factor-based SMT model which allows us EMEA dataset. to incorporate various linguistic features into an SMT baseline, among which those features com- The size of the resulting datasets are: 151,718 ing from the MRS are described in Section 5 in sentence pairs for the SETIMES dataset. Simi- detail. We show our experiments in Section 6 as lar approach was undertaken for another dataset well as both automatic and manual evaluation of from OPUS corpus - EMEA. After the cleaning the results. Section 7 briefly mentions some re- 704,631 sentence pairs were selected from the lated work and then we summarize this paper in EMEA dataset. Thus, the size of the original Section 8. datasets was decreased by 10%.

2 Data Preparation 3 Linguistic Preprocessing

In our experiments we are using the SETIMES The data in SETIMES dataset was analysed on the parallel corpus, which is part of the OPUS parallel following levels: corpus1. The data in the corpus was aligned auto- matically. Thus, we first checked the consistency • POS tagging. POS tagging is performed by of the automatic alignments. It turned out that a pipe of several modules. First we apply more than 25% of the sentence alignments were SVM POS tagger which takes as an input not correct. Since SETIMES appeared to be a a tokenised text and its output is a tagged noisy dataset, our effort was directed into cleaning text. The performance is near 91% accuracy. it as much as possible before the start of the ex- The SVM POS tagger is implemented us- periments. We first corrected manually more than ing SVMTool (Gimnez and Mrquez, 2004). 25.000 sentence alignments. The the rest of the Then we apply a morphological lexicon and data set includes around 135,000 sentences. Al- a set of rules. The lexicon add all the pos- together the data set is about 160,000 sentences, sible tags for the known words. The rules when the manually checked part is added. Thus, reduce the ambiguity for some of the sure two actions were taken: cases. The result of this step is a tagged text with some ambiguities unresolved. The third 1. Improving the tokenization of the Bulgar- step is application of the GTagger (Georgiev ian part. The observations from the man- et al., 2012). It is trained on an ambigu- ual check of the set of 25,000 sentences ous data and select the most appropriate tags showed systematic errors in the tokenized from the suggested ones. The accuracy of the text. Hence, these cases have been detected whole pipeline is 97.83%. In this pipeline and fixed semi-automatically. SVM POS Tagger plays the role of guesser for the GTagger. 2. Correcting and removing the suspicious alignments. Initially, the ratio of the lengths • Lemmatization. The lemmatization mod- of the English and Bulgarian sentences was ule is based on the same morphological lexi- calculated in the set of the 25,000 manually con. From the lexicon we extracted functions annotated sentences. As a rule, the Bulgarian which convert each wordform into its basic 1OPUS–an open source parallel corpus, form (as a representative of the lemma). The http://opus.lingfil.uu.se/ functions are defined via two operations on 120 wordforms: remove and concatenate. The constraints in a local context”. In our case, all rules have the following form: the linguistic features (factors) associated with if tag = Tag then {remove OldEnd; concatenate each token form a supertag to that token. Singh NewEnd} and Bandyopadhyay (2010) had a similar idea of incorporating linguistic features, while they where Tag is the tag of the wordform, Old- worked on Manipuri-English bidirectional trans- End is the string which has to be removed lation. Our approach is slightly different from from the end of the wordform and NewEnd (Birch et al., 2007) and (Hassan et al., 2007), who is the string which has to be concatenated to mainly used the supertags on the target language the beginning of the word form in order to side, English. We primarily experiment with the produce the lemma. The rules are for word source language side, Bulgarian. This potentially forms in the lexicon. Less than 2% of the huge feature space provides us with various possi- wordforms are ambiguous in the lexicon (but bilities of using our linguistic resources developed they are very rare in real texts). Similar rules in and out of our project. are defined for unknown words. The accu- In particular, we consider the following factors racy of the lemmatizer is 95.23%. on the source language side (Bulgarian):

• Dependency parsing. We have trained the • WF - word form is just the original text to- MALT Parser on the dependency version of ken. BulTreeBank2. We did this work together with Svetoslav Marinov who has experience • LEMMA is the lexical invariant of the orig- in using the MALT Parser and Johan Hall inal word form. We use the lemmatizer who is involved in thedevelopment of Malt described in Section 3, which operates on Parser. The trained model achieves 85.6% the output from the POS tagging. Thus, labeled parsing accuracy. It is integrated in the 3rd person, plural, imperfect tense verb a language pipe with the POS tagger and the form ‘varvyaha’ (‘walking-were’, They were lemmatizer. walking) is lemmatized as the 1st person, present tense verb ‘varvya’. After the application of the language pipeline, the result is represented in a table form following • POS - part-of-speech of the word. We use the CoNLL shared task format3. the positional POS tag set of the BulTree- Bank, where the first letter of the tag indi- 4 Factor-based SMT Model cates the POS itself, while the next letters re- Our approach is built on top of the factor-based fer to semantic and/or morphosyntactic fea- SMT model proposed by Koehn and Hoang tures, such as: Dm - where ‘D’ stands for (2007), as an extension of the traditional phrase- ‘adverb’, and ‘m’ stand for ‘modal’; Ncmsi based SMT framework. Instead of using only the - where ‘N’ stand for ‘noun’, ‘c’ means word form of the text, it allows the system to take ‘common’, ‘m’ is ‘masculine’, ‘s’ is ‘singu- a vector of factors to represent each token, both lar’,and ‘i’ is ‘indefinite’. for the source and target languages. The vec- • LING - other linguistic features derived from tor of factors can be used for different levels of the POS tag in the BulTreeBank tagset (see linguistic annotations, like lemma, part-of-speech above). (POS), or other linguistic features. Furthermore, this extension actually allows us to incorporate In addition to these, we can also incorporate various kinds of features if they can be (somehow) syntactic structure of the sentence by breaking represented as annotations to the tokens. down the tree into dependency relations. For in- The process is quite similar to supertagging stance, a dependency tree can be represented as (Bangalore and Joshi, 1999), which assigns “rich a set of triples in the form of . and will represent the sentence “John 3http://ufal.mff.cuni.cz/conll2009-st/ loves Mary”. Consequently, three additional fac- task-description.html tors are included for both languages: 121 • DEPREL - is the dependency relation be- and quantifiers, instead providing a single under- tween the current word and the parent node. specified representation from which the complete set of readings can be constructed. Here we will • HLEMMA is the lemma of the current word’s present only basic definitions from (Copestake et parent node. al., 2005). For more details the cited publication should be consulted. An MRS structure is a tu- • HPOS is the POS tag of the current word’s ple h GT , R, C i, where GT is the top handle, parent node. R is a bag of EPs (elementary predicates) and C is a bag of handle constraints, such that there is Here is an example of a processed sentence. no handle h that outscopes GT . Each elementary The sentence is “spored odita v elektricheskite predication contains exactly four components: (1) kompanii politicite zloupotrebyavat s dyrzhavnite a handle which is the label of the EP; (2) a rela- predpriyatiya.” The glosses for the words in tion; (3) a list of zero or more ordinary variable the Bulgarian sentence are: spored (according) arguments of the relation; and (4) a list of zero or odita (audit-the) v (in) elektricheskite (electrical- more handles corresponding to scopal arguments the) kompanii (companies) politicite (politicians- of the relation (i.e., holes). RMRS is introduced the) zloupotrebyavat (abuse) s (with) dyrzhavnite as a modification of MRS which to capture the se- (state-the) predpriyatiya (enterprises). The trans- mantics resulting from the shallow analysis. Here lation in the original source is : “electricity au- the following assumption is taken into account the dits prove politicians abusing public companies.” shallow processor does not have access to a lexi- The result from the linguistic processing and the con. Thus it does not have access to arity of the addition of information about head elements are relations in EPs. Therefore, the representation has presented in the first seven columns of Table 1. to be underspecified with respect to the number We extend the grammatical features to have the of arguments of the relations. The names of rela- same size. All the information is concatenated to tions are constructed on the basis of the lemma for the word forms in the text. In the next section we each wordform in the text and the main argument present how we extend this format to incorporate for the relation is specified. This main argument the MRS analysis. In the next section we will ex- could be of two types: referential index for nouns tend this example to incorporate the MRS analysis and event for the other part of speeches. of the sentence. Because in this work we are using only the RMRS relation and the type of the main argument 5 MRS Supertagging as features to the translation model, we will skip Our work on Minimal Recursion Semantic anal- here the explanation of the full structure of RMRS ysis of Bulgarian text is inspired by the work structures and how they are constructed. Thus, we on MRS and RMRS (Robust Minimal Recursion firstly do a match between the surface tokens and Semantic) (see (Copestake, 2003) and (Copes- the MRS elementary predicates (EPs) and then take, 2007)) and the previous work on transfer extract the following features as extra factors: of dependency analyses into RMRS structures de- • EP - the name of the elementary predicate, scribed in (Spreyer and Frank, 2005) and (Jakob which usually indicates an event or an entity et al., 2010). In this section we present first a short semantically. overview of MRS and RMRS. Then we discuss the new features added on the basis of the RMRS • EOV indicates the current EP is either an structures. event or a reference variable. MRS is introduced as an underspecified se- mantic formalism (Copestake et al., 2005). It is Notice that we do not take all the information used to support semantic analyses in the English provided by the MRS, e.g., we throw away the HPSG grammar ERG (Copestake and Flickinger, scopal information and the other arguments of the 2000), but also in other grammar formalisms like relations. This kind of information is not straight- LFG. The main idea is that the formalism avoids forward to be represented in such ‘tagging’-style spelling out the complete set of readings resulting models, which will be tackled in the future. from the interaction of scope bearing operators This information for the example sentence is 122 WF Lemma POSex Ling DepRel HLemma HPOS EP EoV spored spored R adjunct zloupotrebyavam VP spored r e odita odit Nc npd prepcomp spored R odit n v v v R mod odit Nc v r e elektricheskite elektricheski A pd mod kompaniya Nc elekticheski a e kompanii kompaniya Nc fpi prepcomp v R kompaniya n v politicite politik Nc mpd subj zloupotrebyavam Vp politik n v zloupotrebyavat zloupotrebyavam Vp tir3p root - - zloupotrebyavam v e s s R indobj zloupotrebyavam Vp s r e dyrzhavnite dyrzhaven A pd mod predpriyatie Nc dyrzhaven a e predpriyatiya predpriyatie Nc npi prepcomp s R predpriyatie n v

Table 1: The sentence analysis with added head information — HLemma and HPOS. represented for each word form in the last two We split the corpora into the training set, the columns of Table 1. development set and the test set. For SETIMES, All these factors encoded within the corpus the split is 100,000/500/1,000 and for EMEA, it provide us with a rich selection of factors for dif- is 700,000/500/1,000. For reference, we also run ferent experiments. Some of them are presented tests on the JRC-Acquis corpus5. The final results within the next section. The model of encoding under the standard evaluation metrics are shown MRS information in the corpus as additional fea- in the following table in terms of BLEU (Papineni tures does not depend on the actual semantic anal- et al., 2002): ysis — MRS or RMRS, because both of them pro- Corpora Test Dev Final Drop vide enough semantic information. SETIMES → SETIMES 34.69 37.82 36.49 / EMEA → EMEA 51.75 54.77 51.62 / 6 Experiments SETIMES → EMEA 13.37 / / 61.5% SETIMES → JRC-Acquis 7.19 / / 79.3% 6.1 Experiments with the Bulgarian raw EMEA → SETIMES 7.37 / / 85.8% EMEA → JRC-Acquis 9.21 / / 82.2% corpus To run the experiments, we use the phrase-based Table 2: Results of the baseline SMT system translation model provided by the open-source (Bulgarian-English) statistical machine translation system, Moses4 (Koehn et al., 2007). For training the translation As we mentioned before, the EMEA corpus model, the parallel corpora (mentioned in Sec- is mainly about the description of medicine us- tion 2) were preprocessed with the tokenizer and age, and the format is quite fixed. Therefore, it lowercase converter provided by Moses. Then the is not surprising to see high performance on the procedure is quite standard: in-domain test (2nd row in Table 2). SETIMES, consisting of news articles, is in a less controlled • We run GIZA++ (Och and Ney, 2003) for bi- setting. The BLEU score is lower6. The results on directional word alignment, and then obtain the out-of-domain tests are in general much lower the lexical translation table and phrase table. with a drop of more than 60% in BLEU score (the last column). For the JRC-Acquis corpus, in con- • A tri-gram language model is estimated us- trast to the in-domain scores given by Koehn et ing the SRILM toolkit (Stolcke, 2002). al. (2009) (61.3), the low out-of-domain results shows a very similar situation as EMEA. A brief • Minimum error rate training (MERT) (Och, manual check of the results indicate that the out- 2003) is applied to tune the weights for the of-domain tests suffer severely from the missing set of feature weights that maximizes the of- ficial f-score evaluation metric on the devel- 5http://optima.jrc.it/Acquis/ 6 opment set. Actually, the BLEU score itself is higher than for most of the other language pairs http://matrix.statmt. org/. As the datasets are different, the results are not di- The rest of the parameters we use the default rectly comparable. Here, we just want to get a rough pic- setting provided by Moses. ture. Achieving better performance for Bulgarian-to-English translation than for other language pairs is not the focus of 4http://www.statmt.org/moses/ the paper. 123 lexicon, while the in-domain test for the news arti- 17-18) also delivers descent results. In future ex- cles contains more interesting issues to look into. periments we will consider to include more fea- The better translation quality also makes the sys- ture from the MRS analyses. tem outputs human readable. So far, incorporating additional linguistic knowledge has not shown huge improvement in 6.2 Experiments with the terms of statistical evaluation metrics. However, Linguistically-Augmented Bulgarian this does not mean that the translations delivered Corpus are the same. In order to fully evaluate the system, As we described the factor-based model in Sec- manual analysis is absolutely necessary. We are tion 4, we also perform experiments to test the still far from drawing a conclusion at this point, effectiveness of different linguistic annotations. but the preliminary scores calculated already indi- The different configurations we considered are cate that the system can deliver decent translation shown in the first column of Table 3. quality consistently. These models can be roughly grouped into 6.3 Manual Evaluation five categories: word form with linguistic fea- tures; lemma with linguistic features; models We manually validated the output for all the mod- with dependency features; MRS elementary pred- els mentioned in Table 3. The guideline in- icates (EP) and the type of the main argument of cludes two aspects of the quality of the transla- the predicate (EOV); and MRS features without tion: Grammaticality and Content. Grammati- word forms. The setting of the system is mostly cality can be evaluated solely on the system out- the same as the previous experiment, except for put and Content by comparison with the reference 1) increasing the training data from 100,000 to translation. We use a 1-5 score for each aspect as 150,000 sentence pairs; 2) specifying the factors follows: during training and decoding; and 3) without do- Grammaticality ing MERT7. We perform the finer-grained model 1. The translation is not understandable. only on the SETIMES data, as the language is more diverse (compared to the other two corpora). 2. The evaluator can somehow guess the mean- The results are shown in Table 3. ing, but cannot fully understand the whole The first model is served as the baseline here. text. We show all the n-gram scores besides the final 3. The translation is understandable, but with BLEU, since the some of the differences are very some efforts. small. In terms of the numbers, POS seems to 4. The translation is quite fluent with some mi- be an effective factor, as Model 2 has the highest nor mistakes or re-ordering of the words. score. Model 3 indicates that linguistic features also improve the performance. Model 4-6 show 5. The translation is perfectly readable and the necessity of including the word form as one grammatical. of the factors, in terms of BLEU scores. Model Content 10 shows significant decrease after incorporating HLEMMA feature. This may be due to the data 1. The translation is totally different from the sparsity, as we are actually aligning and translat- reference. ing bi-grams instead of tokens. This may also in- 2. About 20% of the content is translated, miss- dicate that increasing the number of factors does ing the major content/topic. not guarantee performance enhancement. After replacing the HLEMMA with HPOS, the result is 3. About 50% of the content is translated, with close to the others (Model 8). The experiments some missing parts. with features from the MRS analyses (Model 11- 4. About 80% of the content is translated, miss- 16) show improvements over the baseline consis- ing only minor things. tently and using only the MRS features (Model 5. All the content is translated. 7This is mainly due to the large amount of computation required. We will perform MERT on the better-performing For the missing lexicons or not-translated configurations in the future. Cyrillic tokens, we ask the evaluators to score 2 124 ID Model BLEU 1-gram 2-gram 3-gram 4-gram 1 WF 38.61 69.9 44.6 31.5 22.7 2 WF,POS 38.85 69.9 44.8 31.7 23.0 3 WF,LEMMA,POS,LING 38.84 69.9 44.7 31.7 23.0 4 LEMMA 37.22 68.8 43.0 30.1 21.5 5 LEMMA,POS 37.49 68.9 43.2 30.4 21.8 6 LEMMA,POS,LING 38.70 69.7 44.6 31.6 22.8 7 WF,DEPREL 36.87 68.4 42.8 29.9 21.1 8 WF,DEPREL,HPOS 36.21 67.6 42.1 29.3 20.7 9 WF,LEMMA,POS,LING,DEPREL 36.97 68.2 42.9 30.0 21.3 10 WF,LEMMA,POS,LING,DEPREL,HLEMMA 29.57 60.8 34.9 23.0 15.7 11 WF,POS,EP 38.74 69.8 44.6 31.6 22.9 12 WF,POS,LING,EP 38.76 69.8 44.6 31.7 22.9 13 WF,EP,EOV 38.74 69.8 44.6 31.6 22.9 14 WF,POS,EP,EOV 38.74 69.8 44.6 31.6 22.9 15 WF,LING,EP,EOV 38.76 69.8 44.6 31.7 22.9 16 WF,POS,LING,EP,EOV 38.76 69.8 44.6 31.7 22.9 17 EP,EOV 37.22 68.5 42.9 30.2 21.6 18 EP,EOV,LING 38.38 69.3 44.2 31.3 22.7

Table 3: Results of the factor-based model (Bulgarian-English, SETIMES 150,000) for one Cyrillic token and score 1 for more than kens in our view could be solved in most of the one tokens in the output translation. cases by providing additional lexical information The results are shown in the following two ta- from a Bulgarian-English lexicon. Thus, we also bles, Table 4 and Table 5, respectively. The cur- evaluated the possible impact of such a lexicon if rent results from the manual validation are on the it had been available. In order to do this, we sub- basis of 150 sentence pairs. The numbers shown stituted each copied Cyrillic token with its trans- in the tables are the number of sentences given the lation when there was only one possible transla- corresponding scores. The ‘Total’ column sums tion. We did such substitutions for 189 sentence up the scores of all the output sentences by each pairs. Then we evaluated the result by classify- model. ing the translations as acceptable or unacceptable. The results show that linguistic and seman- The number of the acceptable translations are 140 tic analyses definitely improve the quality of the in this case. translation. Exploiting the linguistic processing The manual evaluation of the translation mod- on word level — LEMMA, POS and LING — pro- els on a bigger scale is in progress. The current re- duces the best result. However, the model with sults are promising. Statistical evaluation metrics only EP and EOV features also delivers very good can give us a brief overview of the system perfor- results, which indicates the effectiveness of the mance, but the actual translation quality is much MRS features from the deep hand-crafted gram- more interesting to us, as in many cases, the dif- mars. Including more factors (especially the in- ferent surface translations can convey exactly the formation from the dependency parsing) drops the same meaning in the context. results because of the sparseness effect over the 7 Related Work dataset, which is consistent with the automatic evaluation BLEU score. The last two rows are Our work is also enlightened by another line of shown for reference. ‘Google’ shows the results research, transfer-based MT models, which are of using the online translation service provided by seemingly different but actually very close. In this http://translate.google.com/. The section, before we mention some previous work high score (very close to the reference translation) in this research direction, we firstly introduce the may be because our test data are not excluded background of the development of the deep HPSG from their training data. In future we plan to do grammars. the same evaluation with a larger dataset. The MRSes are usually delivered together with The problem with the untranslated Cyrillic to- the HPSG analyses of the text. There already 125 ID Model 1 2 3 4 5 Total 1 WF 20 47 5 32 46 487 2 WF,POS 20 48 5 37 40 479 3 WF,LEMMA,POS,LING 20 47 6 34 43 483 4 LEMMA 15 34 11 46 44 520 5 LEMMA,POS 15 38 12 51 34 501 6 LEMMA,POS,LING 20 48 5 34 43 482 7 WF,DEPREL 32 48 3 29 38 443 8 WF,DEPREL,HPOS 45 41 7 23 34 410 9 WF,LEMMA,POS,LING,DEPREL 34 47 5 30 34 433 10 WF,LEMMA,POS,LING,DEPREL,HLEMMA 101 32 0 8 9 242 11 WF,POS,EP 19 49 4 34 44 485 12 WF,POS,LING,EP 19 49 3 39 40 482 13 WF,EP,EOV 20 49 2 41 38 478 14 WF,POS,EP,EOV 19 50 3 31 47 487 15 WF,LING,EP,EOV 19 48 5 37 41 483 16 WF,POS,LING,EP,EOV 19 49 5 37 40 480 17 EP,EOV 15 41 10 44 40 503 18 EP,EOV,LING 20 49 7 38 36 471 19 GOOGLE 0 2 20 52 76 652 20 REFERENCE 0 0 5 51 94 689

Table 4: Manual evaluation of the grammaticality exist quite extensive implemented formal HPSG et al., 2007), which included a large amount of grammars for English (Copestake and Flickinger, manual work. Graham and van Genabith (2008) 2000), German (Muller¨ and Kasper, 2000), and and Graham et al. (2009) explored the automatic Japanese (Siegel, 2000; Siegel and Bender, 2002). rule induction approach in a transfer-based MT HPSG is the underlying theory of the interna- setting involving two lexical functional grammars tional initiative LinGO Grammar Matrix (Bender (LFGs), which was still restricted by the perfor- et al., 2002). At the moment, precise and lin- mance of both the parser and the generator. Lack guistically motivated grammars, customized on of robustness for target side generation is one of the base of the Grammar Matrix, have been or the main issues, when various ill-formed or frag- are being developed for Norwegian, French, Ko- mented structures come out after transfer. Oepen rean, Italian, Modern Greek, Spanish, Portuguese, et al. (2007) use their generator to generate text Chinese, etc. There also exists a first version of fragments instead of full sentences, in order to in- the Bulgarian Resource Grammar - BURGER. In crease the robustness. We want to make use of the research reported here, we use the linguistic the grammar resources while keeping the robust- modeled knowledge from the existing English and ness, therefore, we experiment with another way Bulgarian grammars. Since the Bulgarian gram- of transfer involving information derived from the mar has limited coverage on news data, depen- grammars. dency parsing has been performed instead. Then, In our approach, we take an SMT system as our mapping rules have been defined for the construc- ‘backbone’ which robustly delivers some trans- tion of RMRSes. lation for any given input. Then, we augment However, the MRS representation is still quite SMT with deep linguistic knowledge. In general, close to the syntactic level, which is not fully lan- what we are doing is still along the lines of previ- guage independent. This requires a transfer at the ous work utilizing deep grammars, but we build a MRS level, if we want to do translation from the more ‘light-weighted’ transfer model. source language to the target language. The trans- 8 Conclusion and Future Work fer is usually implemented in the form of rewrit- ing rules. For instance, in the Norwegian LO- In this paper, we report our work on build- GON project (Oepen et al., 2004), the transfer ing a linguistically-augmented statistical machine rules were hand-written (Bond et al., 2005; Oepen translation model from Bulgarian to English. 126 ID Model 1 2 3 4 5 Total 1 WF 20 46 5 23 56 499 2 WF,POS 20 48 5 24 53 492 3 WF,LEMMA,POS,LING 20 47 1 24 58 503 4 LEMMA 15 32 5 33 65 551 5 LEMMA,POS 15 35 9 32 59 535 6 LEMMA,POS,LING 20 48 5 22 55 494 7 WF,DEPREL 32 49 4 14 51 453 8 WF,DEPREL,HPOS 45 41 2 21 41 422 9 WF,LEMMA,POS,LING,DEPREL 34 48 3 20 45 444 10 WF,LEMMA,POS,LING,DEPREL,HLEMMA 101 32 0 6 11 244 11 WF,POS,EP 19 49 3 20 59 501 12 WF,POS,LING,EP 19 50 2 20 59 500 13 WF,EP,EOV 19 50 4 16 61 500 14 WF,POS,EP,EOV 19 50 2 23 56 497 15 WF,LING,EP,EOV 19 48 4 18 61 504 16 WF,POS,LING,EP,EOV 19 50 3 24 54 494 17 EP,EOV 14 38 7 31 60 535 18 EP,EOV,LING 19 49 7 20 55 493 19 GOOGLE 1 0 9 42 98 686 20 REFERENCE 1 0 5 37 107 699

Table 5: Manual evaluation of the content

Based on our observations of the previous ap- ful linguistic analysis; and also to Laska Laskova, proaches on transfer-based MT models, we de- Stanislava Kancheva and Ivaylo Radev for doing cide to build a hybrid system by combining an the human evaluation of the data. SMT system with deep linguistic resources. We perform a preliminary evaluation on several con- figurations of the system (with different linguis- References tic knowledge). The high BLEU score shows the Srinivas Bangalore and Aravind K. Joshi. 1999. Supertag- high quality of the translation delivered by the ging: an approach to almost parsing supertagging: an ap- SMT baseline; and manual analysis confirms the proach to almost parsing supertagging: an approach to almost parsing. Computational Linguistics, 25(2), June. consistency of the system. Emily M. Bender, Dan Flickinger, and Stephan Oepen. There are various aspects we can improve the 2002. The grammar Matrix. An open-source starter-kit ongoing project: 1) The MRSes are not fully ex- for the rapid development of cross-linguistically consis- plored yet, since we have only considered the EP tent broad-coverage precision grammar. In Proceedings of the Workshop on Grammar Engineering and Evalua- and EOV features. 2) We would like to add factors tion at the 19th International Conference on Computa- on the target language side (English) as well. 3) tional Linguistics, Taipei, Taiwan. The guideline of the manual evaluation needs fur- Alexandra Birch, Miles Osborne, and Philipp Koehn. 2007. ther refinement for considering the missing lexi- Ccg supertags in factored statistical machine translation. cons as well as how much of the content is truly In Proceedings of the Second Workshop on Statistical Ma- conveyed (Farreus´ et al., 2011). 4) We also need chine Translation, pages 9–16, Prague, Czech Republic, June. more experiments to evaluate the robustness of Francis Bond, Stephan Oepen, Melanie Siegel, Ann Copes- our approach in terms of out-domain tests. take, and Dan Flickinger. 2005. Open source machine translation with DELPH-IN. In Proceedings of the Open- Acknowledgements Source Machine Translation Workshop at the 10th Ma- chine Translation Summit, pages 15 – 22, Phuket, Thai- This work was supported by the EuroMatrix- land, September. Plus project (IST-231720) funded by the Euro- Chris Callison-Burch, Philipp Koehn, Christof Monz, and pean Community under the Seventh Framework Omar F. Zaidan. 2011. Findings of the 2011 workshop on statistical machine translation. In Proceedings of the Programme for Research and Technological De- 6th Workshop on SMT. velopment. The authors would like to thank Tania Yu Chen, M. Jellinghaus, A. Eisele, Yi Zhang, S. Hunsicker, Avgustinova for fruitful discussions and her help- S. Theison, Ch. Federmann, and H. Uszkoreit. 2009. 127 Combining multi-engine translations with moses. In Pro- Stefan Muller¨ and Walter Kasper. 2000. HPSG analy- ceedings of the 4th Workshop on SMT. sis of German. In Wolfgang Wahlster, editor, Verbmo- Ann Copestake and Dan Flickinger. 2000. An open source bil. Foundations of Speech-to-Speech Translation, pages grammar development environment and broad-coverage 238 – 253. Springer, Berlin, Germany, artificial intelli- english grammar using hpsg. In Proceedings of the 2nd gence edition. International Conference on Language Resources and Franz Josef Och and Hermann Ney. 2003. A system- Evaluation, Athens, Greece. atic comparison of various statistical alignment models. Ann Copestake, Dan Flickinger, Carl Pollard, and Ivan Sag. Computational Linguistics, 29(1). 2005. Minimal recursion semantics: An introduction. Franz Josef Och. 2003. Minimum error rate training in sta- Research on Language & Computation, 3(4):281–332. tistical machine translation. In Proceedings of ACL. Stephan Oepen, Helge Dyvik, Jan Tore Lønning, Erik Vell- Ann Copestake. 2003. Robust minimal recursion semantics dal, Dorothee Beermann, John Carroll, Dan Flickinger, (working paper). Lars Hellan, Janne Bondi Johannessen, Paul Meurer, Ann Copestake. 2007. Applying robust semantics. In Pro- Torbjørn Nordgard,˚ , and Victoria Rosen.´ 2004. Som a˚ ceedings of the 10th Conference of the Pacific Assocation kapp-ete med trollet? towards MRS-based norwegian to for Computational Linguistics (PACLING), pages 1–12. english machine translation. In Proceedings of the 10th Mireia Farreus,´ Marta R. Costa-jussa,` and Maja Popovic´ International Conference on Theoretical and Method- Morse. 2011. Study and correlation analysis of linguis- ological Issues in Machine Translation, Baltimore, MD. tic, perceptual and automatic machine translation evalu- Stephan Oepen, Erik Velldal, Jan Tore Lønning, Paul ations. Journal of the American Society for Information Meurer, Victoria Rosen,´ and Dan Flickinger. 2007. To- Sciences and Technology, 63(1):174–184, October. wards hybrid quality-oriented machine translation — on Georgi Georgiev, Valentin Zhikov, Petya Osenova, Kiril linguistics and probabilities in MT. In Proceedings of the Simov, and Preslav Nakov. 2012. Feature-rich part-of- 11th Conference on Theoretical and Methodological Is- speech tagging for morphologically complex languages: sues in Machine Translation (TMI-07), Skovde, Sweden. Application to bulgarian. In Proceedings of EACL 2012. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing MIT Press, Cambridge, MA, USA. Zhu. 2002. Bleu: a method for automatic evaluation of Jess Gimnez and Llus Mrquez. 2004. Svmtool: A general machine translation. In Proceedings of ACL. pos tagger generator based on support vector machines. Melanie Siegel and Emily M. Bender. 2002. Efficient In Proceedings of the 4th LREC. deep processing of japanese. In Proceedings of the 19th Yvette Graham and Josef van Genabith. 2008. Packed rules International Conference on Computational Linguistics, for automatic transfer-rule induction. In Proceedings of Taipei, Taiwan. the European Association of Machine Translation Con- Melanie Siegel. 2000. HPSG analysis of Japanese. In Wolf- ference (EAMT 2008), pages 57–65, Hamburg, Germany, gang Wahlster, editor, Verbmobil. Foundations of Speech- September. to-Speech Translation, pages 265 – 280. Springer, Berlin, Yvette Graham, Anton Bryl, and Josef van Genabith. 2009. Germany, artificial intelligence edition. F-structure transfer-based statistical machine translation. Thoudam Doren Singh and Sivaji Bandyopadhyay. 2010. In Proceedings of the Lexical Functional Grammar Con- Manipuri-english bidirectional statistical machine trans- ference, pages 317–328, Cambridge, UK. CSLI Publica- lation systems using morphology and dependency rela- tions, Stanford University, USA. tions. In Proceedings of the Fourth Workshop on Syn- tax and Structure in Statistical Translation, pages 83–91, Hany Hassan, Khalil Sima’an, and Andy Way. 2007. Su- Beijing, China, August. pertagged phrase-based statistical machine translation. In Kathrin Spreyer and Anette Frank. 2005. Projecting RMRS Proceedings of ACL, Prague, Czech Republic, June. from TIGER Dependencies. In Proceedings of the HPSG Max Jakob, Marketa´ Lopatkova,´ and Valia Kordoni. 2010. 2005 Conference, pages 354–363, Lisbon, Portugal. Mapping between dependency structures and composi- Andreas Stolcke. 2002. Srilm – an extensible language tional semantic representations. In Proceedings of the modeling toolkit. In Proceedings of the International 7th International Conference on Language Resources and Conference on Spoken Language Processing, volume 2. Evaluation (LREC 2010), pages 2491–2497. Gregor Thurmair. 2005. Hybrid architectures for machine Philipp Koehn and Hieu Hoang. 2007. Factored translation translation systems. Language Resources and Evalua- models. In Proceedings of EMNLP. tion, 39(1). Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Gregor Thurmair. 2009. Comparing different architectures Callison-Burch, Marcello Federico, Nicola Bertoldi, of hybrid machine translation systems. In Proceedings of Brooke Cowan, Wade Shen, Christine Moran, Richard MT Summit XII. Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of ACL (demo session). Philipp Koehn, Alexandra Birch, and Ralf Steinberger. 2009. 462 machine translation systems for europe. In Proceedings of MT Summit XII. Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, January. 128 Using Sense-labeled Discourse Connectives for Statistical Machine Translation

Thomas Meyer and Andrei Popescu-Belis Idiap Research Institute Rue Marconi 19, 1920 Martigny, Switzerland {thomas.meyer, andrei.popescu-belis}@idiap.ch

Abstract assign sense labels to ambiguous discourse con- nectives, and their scores compare favorably with This article shows how the automatic dis- the state-of-the-art for this task, as shown in Sec- ambiguation of discourse connectives can tion 4. In particular, we consider WordNet rela- improve Statistical Machine Translation tions and temporal expressions as well as candi- (SMT) from English to French. Connec- tives are firstly disambiguated in terms of date translations of connectives as additional fea- the discourse relation they signal between tures (Section 4.2). segments. Several classifiers trained using However, our main goal is not the disambigua- syntactic and semantic features reach state- tion of connectives per se, but the use of the labels of-the-art performance, with F1 scores of assigned to connectives as additional input to an 0.6 to 0.8 over thirteen ambiguous English SMT system. To the best of our knowledge, our connectives. Labeled connectives are then experiments are the first attempts to combine con- used into SMT systems either by mod- ifying their phrase table, or by training nective disambiguation and SMT. Three solutions them on labeled corpora. The best modi- to this combination are compared in Section 5: fied SMT systems improve the translation modifying phrase tables, and training on data la- of connectives without degrading BLEU beled manually, or automatically, with senses of scores. A threshold-based SMT system us- connectives. We further show that a modified ing only high-confidence labels improves SMT system is best used when the confidence for BLEU scores by 0.2–0.4 points. a given label is high (Section 6). The paper con- cludes with a comparison to related work (Sec- 1 Introduction tion 7) and an outline of future work (Section 8).

Current approaches to Statistical Machine Trans- 2 Discourse Connectives in Translation lation (SMT) have difficulties in modeling long- range dependencies between words, including Discourse connectives such as although, however, those that are due to discourse-level phenomena. since or while form a functional category of Among these, discourse connectives are words lexical items that are frequently used to mark that signal rhetorical relations between clauses or coherence or discourse relations such as expla- sentences. Their translation often depends on the nation, synchrony or contrast between units of exact relation signaled in context, a feature that text or discourse. For example, in the Europarl current SMT systems were not designed to cap- corpus from years 199x (Koehn, 2005), the ture, hence their frequent mistranslations of con- following nine lexical items, which are often nectives (see Section 2 below). (though not always) discourse connectives, are In this paper, we present a series of experiments among the 400 most frequent tokens over a that aim to use, in SMT systems, data with au- total of 12,846,003 (in parentheses, rank and tomatically labeled discourse connectives. Sec- number of occurrences): after (244th/6485), tion 3 first presents the data sets used in our ex- although (375th/4062), however (110th/12,857), periments. We designed classifiers that attempt to indeed (334th/4486), rather (316th/4688),

129 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 129–138, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics since (190th/8263), still (168th/9195), while are available. This section shows how we made (390th/3938), yet (331st/4532) – see also (Car- use of these resources and how we augmented toni et al., 2011). Discourse connectives can them by manual and automated annotation of the be difficult to translate, because many of them senses of discourse connectives. can signal different relations between clauses in different contexts. Moreover, if a wrong connec- 3.1 Data for the Disambiguation of tive is used in translation, then a text becomes Discourse Connectives incoherent, as in the two examples below, taken One of the most important resources for discourse from Europarl and translated (EN/FR) with connectives in English is the Penn Discourse Moses (Koehn et al., 2007) trained on the entire Treebank (Prasad et al., 2008). The PDTB pro- corpus: vides a discourse-layer annotation over the Wall Street Journal Corpus (WSJ) and the Penn Tree- 1. EN: This tax, though [contrast], does not come bank syntactic annotation. The discourse anno- without its problems. tation consists of manually annotated senses for FR-SMT: *Cette taxe, memeˆ si [concession], about 100 types of explicit connectives, for im- ne se presente´ pas sans ses problemes.` plicit ones, and their clause spans. For the en- tire WSJ corpus of about 1,000,000 tokens there 2. EN: Finally, and in conclusion, Mr President, are 18,459 instances of annotated explicit connec- with the expiry of the ECSC Treaty, the tives. The senses that discourse connectives can regulations will have to be reviewed since signal are organized in a hierarchy with 4 toplevel [causal] I think that the aid system will have senses, followed by 16 subtypes on the second to continue beyond 2002 . . . level and 23 detailed subsenses on the third level. FR-SMT: *Enfin, et en conclusion, Monsieur Studies making use of the PDTB to build classi- le president,´ a` l’expiration du traite´ CECA, fiers usually split the WSJ corpus into Sections la reglementation´ devra etreˆ revu depuis que 02–21 for training and Section 23 for testing (as [temporal] je pense que le systeme` d’aides we did for our disambiguation experiments, see devront continuer au-dela` de 2002 . . . Section 4). From the PDTB, we extracted the 13 most fre- In the first example, the connective generated quent and most ambiguous connectives: after, al- by SMT (memeˆ si, literally “even if”) signals a though, however, indeed, meanwhile, neverthe- concession and not a contrast, for which the con- less, nonetheless, rather, since, still, then, while, nective mais should have been used (as in the ref- and yet. This set shows in particular that connec- erence). In the second example, the connective tives signaling contrastive or temporal senses are depuis que (literally “from the time”) generated the most ambiguous ones, hence they are also po- by SMT expresses a temporal relation and not a tentially difficult to translate, as this ambiguity is causal one, which should have been conveyed e.g. often not preserved across languages (Danlos and by the French car. Roze, 2011). We used the senses from the sec- Such examples suggest that the disambiguation ond PDTB hierarchy level (as the third level is too of connectives prior to translation could help SMT fine-grained for EN/FR translation) and generated systems to generate a correct connective in the tar- the training and testing sets listed with statistics in get language. Of course, depending on the lan- Table 1 (Section 4). guage pair, some ambiguities can be carried over In principle, classifiers trained on PDTB data from the source to the target language, so they can be applied directly to label connectives over need not be solved. Still, improving the over- the English side of the Europarl corpus (Koehn, all translation of discourse connectives should in- 2005) used for training and testing SMT. How- crease the overall coherence of MT output, with a ever, to control the difference in register from potential large impact on perceived quality. newswire texts to formal political speech, and to 3 Data Used in Our Experiments allow for future studies of other languages, we also performed manual annotation (Cartoni et al., For both tasks, the disambiguation of connectives 2011) of five connectives over the Europarl corpus and SMT, different training and testing data sets (although, even though, since, though and while).

130 The manual annotation was performed on sub- News Commentary (NC) EN/FR development set sets of Europarl v5 (years 199x) for the first few with the following modifications: hundred occurrences of each connective. Instead d: Phrase table: NC 2009 (2,051 sentences), no of a potentially difficult and costly annotation of modifications. senses, as in the PDTB, we performed translation e: Manual annotation: NC 2009 (2,051 sen- spotting, asking annotators to highlight the trans- tences), minus all 123 sentences containing lation of each of the five connectives in the French one of the above 5 connective types, plus 102 side of the corpus. From the list of all observed sentences with manually sense-labeled con- translations one can then cluster the necessary nectives. sense labels, as some target language connectives clearly signal only one sense or, in cases where f: Automated annotation: NC 2009 (2,051 sen- ambiguity is preserved, one can group the equally tences), all occurrences of the 13 PDTB sub- ambiguous connectives under one composite la- set connective types have been labeled by bel. For example, while is sometimes translated classifiers (in 340 sentences). to the French discourse connectives tandis que or For testing our modified SMT systems, three alors que which both preserve the ambiguity of test sets were extracted in the following way: while signaling a temporal or contrastive sense. With this method we built the data sets listed with g: 35 sentences from NC 2007, with 7 occur- statistics in Table 2 below (Section 4). rences for each of the 5 connective types above, manually labeled. 3.2 Data for Statistical Machine Translation h: 62 sentences from NC 2007 and 2006 with oc- The translation data for our SMT experiments has currences for the 13 PDTB connective types, been often used in other MT research work and is automatically labeled with classifiers. freely distributed for the shared tasks of the Work- i: 10,311 sentences from the EN/FR UN corpus, shop on Machine Translation (WMT)1. all occurrences of the five Europarl connec- For training our SMT systems, the EN/FR Eu- tive types, automatically labeled with classi- roparl corpus v5 was used in three ways to inte- fiers. grate data with labeled discourse connectives into These test sets might appear small compared to SMT: no changes (for MT phrase table modifica- the amount of data normally used for SMT system tions), integration of manually annotated data and testing. In our system evaluation however, apart integration of automatically labeled data. These from automated scoring, we also had to perform methods are described below in Section 5 – here, manual counts of improved translations, which is we gather descriptions of the corresponding data. why we could not evaluate more than a hundred a: Modification of the phrase table: Europarl sentences (Section 5). When counting manually (346,803 sentences), labeling the translation for test set (i), it was downsampled to the same model after training. amount of 35 and 62 sentences as for sets (g) b: Integration of manual annotation: Europarl and (h), by extracting the first occurrences of each (346,803 sentences), minus all 8,901 sen- connective. tences containing one of the above 5 connec- In all experiments, we use the Moses Phrase- tive types, plus 1,147 sentences with manu- based SMT decoder (Koehn et al., 2007) and a 5- ally sense-labeled connectives. gram language model built over the entire French c: Integration of automated annotation: Europarl part of the Europarl corpus v5. – years 199x (58,673 sentences), all occur- 4 Automatically Disambiguating rences of the 13 PDTB subset connective Discourse Connectives types have been labeled by classifiers (in 6,961 sentences). 4.1 Classifier PT: Trained on PDTB Data A first classifier (‘PT’) for ambiguous discourse For Minimum Error Rate tuning (MERT) (Och, connectives and their senses was built by using 2003) of the SMT systems, we used the 2009 the PDTB subset of 13 ambiguous connectives as 1statmt.org/wmt10/translation-task.html training material. For each connective we built a

131 Connective Number of occurrences and senses F1 Scores Training set: total and per sense Test set: total and per sense PT PT+ after 507 456 As, 51 As/Ca 25 22 As, 3 As/Ca 0.66 1.00 although 267 135 Cs, 118 Ct, 14 Cp 16 9 Ct, 7 Cs 0.60 0.66 however 176 121 Ct, 32 Cs, 23 Cp 14 13 Ct, 1 Cs 0.33 1.00 indeed 69 37 Cd, 24 R, 3 Ca, 3 E, 2 I *2 2 R *0.50 *0.50 meanwhile 117 66 Cj/S, 16 Cd, 16 S, 14 10 5 S, 5 Ct/S 0.32 0.53 Ct/S, 5 Ct nevertheless 26 15 Ct, 11 Cs 6 4 Cs, 2 Ct 0.44 0.66 nonetheless 12 7 Cs, 3 Ct, 2 Cp *1 1 Cs *1.00 *1.00 rather 10 6 R, 2 Al, 1 Ca, 1 Ct *1 1 Al *0.00 *0.00 since 166 75 As, 83 Ca, 8 As/Ca 9 4 As, 3 Ca, 2 0.78 0.78 As/Ca still 114 56 Cs, 51 Ct, 7 Cp 13 9 Ct, 4 Cs 0.60 0.66 then 145 136 As, 6 Cd, 3 As/Ca 6 5 As, 1 Cd 0.83 1.00 while 631 317 Ct, 140 S, 79 Cs, 41 37 19 Ct, 10 S, 4 Cs, 0.93 0.96 Ct/S, 36 Cd, 18 Cp 4 Ct/S yet 80 46 Ct, 25 Cs, 9 Cp *2 2 Ct *0.5 *1.00 Total 2,320 – 142 – 0.57 0.75

Table 1: Performance of MaxEnt connective sense classifiers: Classifier PT (initial feature set) and Classifier PT+ (with candidate translation features) for 13 temporal and contrastive connectives in the PDTB. The sense labels are coded as follows. Al: alternative, As: asynchronous, Ca: cause, Cd: condition, Cj: conjunction, Cp: comparison, Cs: concession, Ct: contrast, E: expansion, I: instantiation, R: restatement, S: synchrony. In some cases marked with ‘*’, the test sets are too small to provide meaningful scores. specialized classifier, by using the Stanford Max- and Pustejovsky, 2008), which annotates English imum Entropy classifier package (Manning and sentences automatically with the TimeML anno- Klein, 2003). Maximum Entropy is known to han- tation language for temporal expressions. For ex- dle discrete features well and has been applied ample, in the sentence The crimes may appear successfully to connective disambiguation before small, but the prices can be huge (PDTB Sec- (see Section 7). tion 2, WSJ file 0290), for example, our features would indicate the antonyms small vs. huge that An initial set of features can directly be ob- signal the contrast, along with a temporal order- tained from the PDTB (and must hence be con- ing of the event appear before the event can. sidered as oracle features): the (capitalized) con- We report the classifier performances as F1 nective token, its POS tag, first word of clause 1, scores for each connective (weighting precision last word of clause 1, first word of clause 2 (the and recall equally) in Table 1, testing on Section one containing the explicit connective), last word 23 of the PDTB. This sense classifier will be re- of clause 2, POS tag of the first word of clause 2, ferred to as Classifier PT in the rest of the paper, type of first word of clause 2, parent syntactical in particular when used for the SMT experiments. categories of the connective, punctuation pattern of the sentences. Apart from these standard fea- 4.2 Classifier PT+: With Candidate tures in discourse connective disambiguation we Translations as Features used WordNet (Miller, 1995) to compute lexical similarity scores with the lesk metric (Baner- In an attempt to improve Classifier PT, we added jee and Pedersen, 2002) for all the possible com- a new type of feature, resulting in Classifier PT+. binations of nouns, verbs and adjectives in the Namely, we used candidate translations of dis- two clauses, as well as antonyms found for these course connectives from a baseline SMT system word groups. In addition, we used features that (not adapted to connectives). To find these values, are likely to help detecting temporal relations and a Moses baseline decoder was used to translate the were obtained from the Tarsqi Toolkit (Verhagen PDTB data, which was then word-aligned (En-

132 Connective Number of occurrences and senses F1 Size of training set: total and per sense Test set: total and per sense Score although 173 155 Cs, 18 Ct 10 5 Cs, 5 Ct 0.67 even though 179 165 Cs, 14 Ct 10 5 Cs, 5 Ct 1.00 since 413 274 S, 131 Ca, 8 S/Ca 10 5 Ca, 3 S, 2 S/Ca 0.80 though 150 80 Cs, 70 Ct 10 5 Cs, 5 Ct 1.00 while 280 130 Cs, 41 Ct, 89 S/Ct, 13 S/Ca, 7 S 14 4 Cs, 2 Ct, 2 S/Ct, 2 0.64 S/Ca, 4 S Total 1,195 – 54 – 0.82

Table 2: Performance of a MaxEnt connective sense classifier (Classifier EU) for 5 connectives in the Europarl corpus. The sense labels are coded as follows. Cs: Concession, Ct: Contrast, S: Synchrony, Ca: Cause. glish source with target French) by using GIZA++ art for discourse connective disambiguation with (Och and Ney, 2003). In this alignment, we detailed senses (Section 7). Classifier EU also searched for the translation equivalents of the 13 compares favorably to PT and PT+, as seen for in- PDTB connectives by using a hand-crafted dic- stance for since (0.80 vs. 0.78) or although (0.67 tionary of possible French translations. When the vs. 0.60–0.66). translation candidate is not ambiguous – e.g. bien que as a translation for while clearly signals a con- 5 Use of Labeled Connectives for SMT cession – its specific sense label was added as the In this section, we report on experiments that value of an additional feature. In some cases, study the effect of discourse connective labeling however, the values of the features are not de- on SMT. The experiments differ with respect to termined (and are set to NONE): either when the the method used for taking advantage of the la- SMT system or GIZA++ failed in translating or bels, but also with respect to the data sets and the aligning a connective, or when the target connec- sense classifiers that are used. tive was just as ambiguous as the source one (e.g. while translated as tandis que, which can be la- 5.1 Evaluation Metrics for MT beled both temporal or contrast). Overall, this procedure led to an accuracy gain of Classifier The variation in MT quality can be estimated in PT+ with respect to Classifier PT of about 0.1 to several ways. On the one hand, we use the BLEU 0.6 F1 score for some of the connectives, as can metric (Papineni et al., 2002) with one reference be seen in the last column of Table 1. translation as is most often done in current SMT research2. To improve confidence in the BLEU scores, especially when test sets are small, we 4.3 Classifier EU: Trained on Europarl Data also compute BLEU scores using bootstrapping As explained in Section 3.1, we performed man- of data sets (Zhang and Vogel, 2010); the test ual annotation of connective senses in Europarl sets are re-sampled a thousand times and the av- as well, to provide labeled instances directly in erage BLEU score is computed from individual the data used for SMT training and to account for sample scores. The BLEU approach is not likely, the register change. For the Europarl data sets, however, to be sensitive enough to the small dif- we built a new MaxEnt classifier (called Classi- ferences due to the correction of discourse con- fier EU) using the same feature set as Classifier nectives (less than one word per sentence). We PT. However, all features were this time extracted therefore additionally resort to a manual evalua- automatically (no oracle). In particular, we used tion metric, referred to as ∆Connectives, which Charniak and Johnson’s (2005) parser to then ex- counts the occurrences of connectives that are bet- tract the syntactic features. In Table 2, we re- ter translated by our modified systems compared port the results of Classifier EU, again in terms to the baseline ones. of F1 scores. For all three classifiers, PT, PT+ 2The scores are generated by the NIST MTeval script and EU, the F1 scores are in a range of 0.6 and version 11b, available from www.itl.nist.gov/iad/ 0.8, thus comparing favorably to the state-of-the- mig/tools/.

133 MT system N. Connectives in MT test data ∆Conn. (%) BLEU scores Occ. Types Labeling + = – Standard Bootstrap Modified phrase table 1 35 5 manual 29 51 20 39.92 40.54 2 10,311 5 Cl. EU 34 46 20 22.13 23.63 Trained on manual 3 35 5 manual 32 57 11 41.58 42.38 annotations 4 10,311 5 Cl. EU 26 66 8 22.43 24.00 Trained on automatic 5 62 13 Cl. PT 16 60 24 14.88 15.96 annotations (Cl. PT) 6 10,311 5 Cl. EU 16 66 18 19.78 21.17 Trained on automatic 7 62 13 Cl. PT+ 11 70 19 15.67 16.73 annotations (Cl. PT+) 8 10,311 5 Cl. EU 18 68 14 20.14 21.55

Table 3: MT systems dealing with manually and automatically (PT, PT+, EU) sense-labeled connectives: BLEU scores (including bootstrapped ones) and variation in the translation of individual connectives (∆Connectives, as a percentage). The description of each condition and the baseline BLEU scores are in the text of the article.

5.2 Phrase Table Modification ing over data set (g) (7 manually annotated sen- tences for each of the 5 connectives) and over A first way of using labeled connectives is to set (i), in which the 5 connectives were automat- modify the phrase table of an SMT system previ- ically labeled with Classifier EU. In the first test, ously trained/tuned on data sets (a)/(d) from Sec- the translations of 29% of the connectives are im- tion 3.2, in order to force it to translate each spe- proved by the modified system, while 20% are cific sense of a discourse connective (as indicated degraded and 51% remain unchanged – thus re- by its label) with an acceptable equivalent se- flecting an overall 10% improvement in the trans- lected among those learned from the training data. lations of connectives (∆Connectives). How- Of course, this only handles cases when connec- ever, for this test set, the BLEU score is about 3 tives are translated by explicit lexical items (typ- points below the baseline SMT system that used ically, target connectives) and not by more com- the same phrase table without modification of la- plex grammatical constructs. bels and scores (not shown in Table 3). In exper- The phrase table modification is done as fol- iment 2, however, the BLEU score of the modi- lows. Based on a small dictionary of the five con- fied system is in the same range as the baseline nective types of Table 2, their acceptable French one (22.13 vs. 22.76). As for ∆Connectives, equivalents and the possible senses, the initial as it was not possible to score manually all the phrase table is searched for phrases containing a 10,311 connectives, we sampled 35 sentences and connective and each occurrence is inspected to found that 34% of the connectives are improved, find out which sense is reflected in the transla- 20% are degraded and 46% remain unchanged, tion. If the sense is non-ambiguous, then the ta- again reflecting an improvement in the translation ble entry is modified to include the label, and the of connectives. This shows that piping automatic probability score is set to 1 in order to maximize labeling and SMT with a modified phrase table the chance that the respective translation is found does not degrade the overall BLEU score, while during decoding. For instance, for every phrase increasing ∆Connectives. table entry where while is translated as alors que, this corresponds to a contrastive use and while is changed into while CONTRAST. Or, for the en- tries where while is translated as bien que, the 5.3 Training on Tagged Corpora lexical entry is changed into while CONCESSION. However, when the source entry is as ambiguous We explored a more principled way to integrate as the target one, no modification is made. This external labels into SMT, by using labeled data means that during decoding (testing) with labeled (manually or automatically) for training, so that sentences, these entries will never be used. the system directly learns a modified phrase table The results of the SMT system are shown in which allows the translation of labeled data (auto- experiments 1 and 2 in Table 3, respectively test- matically) when testing.

134 5.3.1 Manual Gold Annotation 60% remained the same and 24% were degraded. We report first two experiments using the man- In experiment 6, the 10,311 UN occurrences for 5 ual gold annotation for the five connective types connective types were first tagged with Classifier over Europarl excerpts, used for training. When EU. Evaluated on a sample of 62 sentences, 16% used also for testing (experiment 3 in Table 3), of the connectives were improved, while 66% re- this can be seen as an oracle experiment, measur- mained the same and 18% were degraded. De- ing the translation improvement when connective spite less training data, in terms of BLEU, the dif- sense labeling is perfect. However, in experiment ference to the respective baseline system (scores 4, the SMT system uses the output of an auto- not shown in Table 3) is similar in both experi- matic labeler. For training/tuning we used data mental settings: 19.78 vs. 20.11 for experiment sets (b)/(e), Section 3.2. 6 (automated annotation), compared to 22.43 vs. In experiment 3, for test set (g), 32% of the 22.76 for experiment 4 (manual annotation). connectives were translated better by the modi- Finally, we carried out two experiments (7 fied system, 57% remained the same, and 11% and 8) with Classifier PT+, which uses as addi- were degraded. In experiment 4, over a 35 sen- tional features the translation candidates and has a tence sample of the bigger test set (i), 26% were higher accuracy than PT (Section 4.2). As a result, improved, 66% remained the same, and only the translation of connectives (∆Connectives) is 8% were degraded. The baseline SMT system indeed improved compared (respectively) to ex- (not shown in Table 3) was built with the same periments 5 and 6, as it appears from lines 7–8 amounts of unlabeled training and tuning data. of Table 3. Also, the BLEU scores of the corre- Overall, the BLEU scores of our modified systems sponding SMT systems are increased in experi- are similar to the baseline ones, though still lower ments 7 vs. 5 and in 8 vs. 6, and are now equal – 41.58 vs. 42.77 for experiment 3, and 22.43 to the baseline ones (for experiment 8: 20.14 vs. vs. 22.76 for experiment 4, also confirmed by the 20.11, or, bootstrapped, 21.55 vs. 21.55). bootstrapped scores. The results of experiments 7/8 vs. 5/6 in- Another comparison shows that the system dicate that improved classifiers for connec- trained on manual annotations (exp. 4) outper- tives also improve SMT output as measured by forms the system using a modified phrase ta- ∆Connectives, with BLEU remaining fairly ble (exp. 2) in terms of BLEU scores (22.43 vs. constant, and therefore are worth investigating 22.13) and bootstrapped ones (24.00 vs. 23.63). in more depth in the future. When compar- ing manual (experiments 3/4) vs. automated an- 5.3.2 Automated Annotation notation (experiments 5/6/7/8) and their use in We evaluated an SMT system trained on data SMT, the differences in the scores (BLEU and that was automatically labeled using the classi- ∆Connectives) highlight a trade-off: manually fiers in Section 4. This method provides a large annotated data used for training leads to better amount of imperfect training data, and uses no scores, but noisier and larger training data that is manual annotations at all, except for the initial annotated automatically is an acceptable solution training of the classifiers. For these experiments when manual annotations are not available. (5 and 6 in Table 3), the BLEU scores as well 6 Classifier Confidence Scores as the manual counts of improved connectives are lower than in the preceding experiments because, As shown with the above experiments, the accu- overall, less training/tuning data was used – about racy of the connective classifiers influences SMT 15% of Europarl, data sets (c) and (f) in Sec- quality. We therefore hypothesize that an SMT tion 3.2. The baseline system was built over the system dealing with labeled connectives would same amount of data, with no labels. best be used when the confidence of the classi- Testing here was performed over the slightly fier is high, while a generic SMT system could be bigger test set (h) with 62 sentences (13 connec- used for lower confidence values. tive types). The occurrences were tagged with We experimented with the confidence scores of Classifier PT prior to translation (exp. 5). Com- Classifier EU, which assigns a score between 0 pared to the baseline system, the translations of and 1 to each of its decisions on the connectives’ 16% of the connectives were improved, while labels. (All processing is automatic in these ex-

135 (a) although (b) since

Figure 1: Use of a combined system (COMB) that directs the input sentences either to a system trained on a sense- labeled corpus (TTC) or to a baseline one (BASE), depending on the confidence of the connective classifier. The x-axis shows the threshold above which TTC is used – BASE being used below it – and the y-axis shows the BLEU scores of COMB with respect to TTC and BASE. Figure (a) is for although and (b) for since. periments, and the evaluation is done solely in ticular for 0.85, with a difference of 0.39 BLEU terms of BLEU). We defined a threshold-based points. procedure to combine SMT systems: if the con- The significance of the observed improvement fidence for a sense label is above a certain thresh- was tested as follows. For each of the two con- old, then the sentence is translated by an SMT nectives, we split the test sets of 1,572 sentences system trained on labeled data from experiment each in five folds, and compared for each fold the 4 (or “tagged corpus”, hence noted TTC), and if it scores of COMB for the best performing thresh- is below the threshold, it is sent to a baseline sys- old (0.95 or 0.85) with the highest of BASE or tem (noted BASE). The resulting BLEU scores of TTC (i.e. BASE for although and TTC for since). the combined system (COMB) obtained for vari- We performed a paired t-test to compute the sig- ous threshold values are shown in Figure 1 for two nificance of the difference, and found p = 0.12 for connectives. although. This value, although slightly above the Firstly, we considered all the 1,572 sentences conventional boundary of 0.1, shows that the five from the UN corpus which contained the connec- pairs of scores reflect a significant difference in tive although, labeled either as contrast or con- quality. Similarly, when performing a t-test for cession. We show BLEU scores of the COMB since, the difference in scores is found significant system for several thresholds in the interval of ob- at the 0.01 level (p = 0.005). Of course, COMB served confidence scores, along with the scores of is always significantly better than the lower of BASE and TTC, in Figure 1(a). The results show BASE or TTC (p < 0.05). In the future, the sys- that the scores of COMB increase with the value tem combination will be tested for all connectives, of the threshold, and that for at least one value and the respective values of the thresholds will be of the threshold (0.95) COMB outperforms both set on tuning, not on test data. TTC and BASE by 0.20 BLEU points. To confirm this finding with another connec- 7 Related Work tive, we took the first 1,572 sentences containing the connective since from the UN corpus. The Discourse parsing (Marcu, 2000) has proven to BLEU scores for COMB are shown for the range be a difficult task, even when complex models of observed confidence values (0.4–1.0) in Fig- (CRFs, SVMs) are used (Wellner, 2009; Her- ure 1(b). For several values of the threshold, nault et al., 2010). The performance of discourse COMB outperforms both BASE and TTC, in par- parsers is in a range of 0.4 to 0.6 F1 score.

136 With the release of the PDTB, recent research English translation. Ma et al. (2011) use a Maxi- focused on the disambiguation of discourse con- mum Entropy model to POS tag English colloca- nectives as a task in its own right. For the disam- tional particles (e.g. come down/by, turn against, biguation of explicit connectives, the state-of-the- inform of ) more specifically than a usual POS tag- art performance for labeling all types of connec- ger does (where only one label is given to all par- tives in English is quite high. In the PDTB data, ticles). The authors claim the usefulness of such the disambiguation of discourse vs. non-discourse a particle tagger for English/Chinese translation, uses of connectives reaches 97% accuracy (Lin et but do not show its actual integration into an MT al., 2010). The labeling of the four main senses system. from the PDTB sense hierarchy (temporal, contin- These approaches, as well as ours, show that gency, comparison, expansion) reaches 94% ac- integrating discourse information into SMT is curacy (Pitler and Nenkova, 2009) – however, the promising and deserves future examination. The baseline accuracy is already around 85% when us- disambiguation of word senses, including func- ing only the connective token as a feature. Vari- tion words, can improve SMT output when the ous methods for classification and feature analy- senses are annotated in a pre-processing step that sis have been proposed (Wellner et al., 2006; El- uses classifiers based on linguistic features at well and Baldridge, 2008). Other studies have the semantic and discourse levels, which are not focused on the analysis of highly ambiguous dis- available to a state-of-the-art SMT systems. course connectives only. Miltsakaki et al. (2005) report classification results for the connectives 8 Conclusion and Future Work since, while and when. Using a Maximum En- This paper has presented methods and results for tropy classifier, they reach 75.5% accuracy for the disambiguation of temporal and contrastive since, 71.8% for while and 61.6% for when. As discourse connectives using MaxEnt classifiers the PDTB was not completed at that time, the data with syntactic and semantic features, in English sets and labels are not exactly identical to the ones texts, in terms of senses intended to help SMT. that we used above (see Section 4). These classifiers have been used to perform exper- The disambiguation of senses signaled by dis- iments with connective-annotated data applied to course connectives can be seen as a word sense EN/FR SMT systems. The results have shown an disambiguation (WSD) problem for functional improvement in the translation of connectives for words (as opposed to WSD for content words, fully automatic systems trained on either hand- which is more frequently studied). The integra- labeled or automatically-labeled data. Moreover, tion of WSD into SMT has especially been stud- BLEU scores were significantly improved by 0.2– ied by Carpuat and Wu (2007), who used the 0.4 when such systems were only used for con- translation candidates output by a baseline SMT nectives that had been disambiguated with high system as word sense labels. This is similar to confidence. our use of translation candidates as an additional In future work we plan to improve the sense feature for classification in Section 4.2. Then, classifiers using additional features, to improve the output of several classifiers based on linguis- their integration with SMT, and to unify our data tic features was weighed against the translation sets through additional manual annotations over candidates output by the baseline SMT system. Europarl. The applicability of the method to other With this procedure, their WSD+SMT system im- languages will also be demonstrated experimen- proved the BLEU scores by 0.4–0.5 for the En- tally. glish/Chinese pair. Chang et al. (2009) use a LogLinear classi- Acknowledgments fier with linguistic features in order to disam- biguate the Chinese particle ‘DE’ that has five dif- We are grateful for the funding of this ferent context-dependent uses (modifier, preposi- work to the Swiss National Science Foun- tion, relative clause etc.). When the classifier is dation (SNSF), under the COMTIS Sin- used to annotate the particle prior to SMT, the ergia Project n. CRSI22 127510 (see output of the translation system improves by up www.idiap.ch/comtis/). to 1.49 BLEU score for phrase-based Chinese to

137 References tion without Magic. Tutorial at HLT-NAACL and 41st ACL conferences, Edmonton, AB and Sapporo. Satanjeev Banerjee and Ted Pedersen. 2002. An Daniel Marcu. 2000. The Theory and Practice of Adapted Lesk Algorithm for Word Sense Disam- Discourse Parsing and Summarization. A Bradford biguation Using WordNet. In Alexander Gel- Book. The MIT Press, Cambridge, MA. bukh, editor, Computational Linguistics and Intelli- George A. Miller. 1995. WordNet: A Lexical gent Text Processing, LNCS 2276, pages 117–171. Database for English. Communications of the ACM, Springer, Berlin/Heidelberg. 38(11):39–41. Marine Carpuat and Dekai Wu. 2007. Improving Sta- Eleni Miltsakaki, Nikhil Dinesh, Rashmi Prasad, Ar- tistical Machine Translation using Word Sense Dis- avind Joshi, and Bonnie Webber. 2005. Exper- ambiguation. Proc. of EMNLP-CoNLL, pages 61– iments on Sense Annotations and Sense Disam- 72, Prague. biguation of Discourse Connectives. Proc. of the Bruno Cartoni, Sandrine Zufferey, Thomas Meyer, and 4th Workshop on Treebanks and Linguistic Theories Andrei Popescu-Belis. 2011. How Comparable (TLT), Barcelona. are Parallel Corpora? Measuring the Distribution of Franz Josef Och and Hermann Ney. 2003. A System- General Vocabulary and Connectives. Proc. of the atic Comparison of Various Statistical Alignment 4th Workshop on Building and Using Comparable Models. Computational Linguistics, 29(1):19–51. Corpora (BUCC), pages 78–86, Portland, OR. Franz Josef Och. 2003. Minimum Error Rate Train- Pi-Chuan Chang, Dan Jurafsky, and Christopher D. ing in Statistical Machine Translation. Proc. of the Manning. 2009. Disambiguating ‘DE’ for Chinese- 41st Annual Meeting of the ACL, pages 160–167, English Machine Translation. Proc. of the Fourth Sapporo. Workshop on Statistical Machine Translation at Kishore Papineni, Salim Roukos, Todd Ward, and EACL-2009, Athens. Wei-Jing Zhu. 2002. BLEU: A method for Au- Eugene Charniak and Mark Johnson. 2005. Coarse- tomatic Evaluation of Machine Translation. Proc. to-fine n-best Parsing and MaxEnt Discriminative of 40th Annual Meeting of the ACL, pages 311–318, Reranking. Proc. of the 43rd Annual Meeting of the Philadelphia, PA. ACL, pages 173–180, Ann Arbor, MI. Emily Pitler and Ani Nenkova. 2009. Using Syntax Laurence Danlos and Charlotte Roze. 2011. Traduc- to Disambiguate Explicit Discourse Connectives in tion (Automatique) des Connecteurs de Discours. Text. Proc. of the 47th Annual Meeting of the ACL Actes 18e Conference´ sur le Traitement Automa- and the 4th International Joint Conference of the tique des Langues Naturelles (TALN), Montpellier. AFNLP (ACL-IJCNLP), Short Papers, pages 13–16, Robert Elwell and Jason Baldridge. 2008. Discourse Singapore. Connective Argument Identification with Connec- Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- tive Specific Rankers. Proc. of the 2nd IEEE sakaki, Livio Robaldo, Aravind Joshi, and Bon- International Conference on Semantic Computing nie Webber. 2008. The Penn Discourse Treebank (ICSC), pages 198–205, Santa Clara, CA. 2.0. Proc. of the 6th International Conference on Hugo Hernault, Helmut Prendinger, David A. duVerle, Language Resources and Evaluation (LREC), pages and Mitsuru Ishizuka. 2010. HILDA: A Discourse 2961–2968, Marrakech. Parser using Support Vector Machine classification. Marc Verhagen and James Pustejovsky. 2008. Tem- Dialogue and Discourse, 3(1):1–33. poral Processing with the TARSQI Toolkit. Proc. Philipp Koehn, et al. 2007. Moses: Open Source of the 22nd International Conference on Com- Toolkit for Statistical Machine Translation. Proc. putational Linguistics (COLING), Demonstrations, of 45th Annual Meeting of the ACL, Demonstration pages 189–192, Manchester, UK. Session, pages 177–180, Prague. Ben Wellner, James Pustejovsky, Catherine Havasi, Philipp Koehn. 2005. Europarl: A Parallel Corpus for Roser Sauri, and Anna Rumshisky. 2006. Classi- Statistical Machine Translation. Proc. of MT Sum- fication of Discourse Coherence Relations: An Ex- mit X, pages 79–86, Phuket. ploratory Study using Multiple Knowledge Sources. Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2010. Proc. of the 7th SIGdial Meeting on Discourse and A PDTB-styled End-to-end Discourse Parser. Tech- Dialog, pages 117–125, Sydney. nical Report TRB8/10, School of Computing, Na- Ben Wellner. 2009. Sequence Models and Ranking tional University of Singapore. Methods for Discourse Parsing. PhD thesis, Bran- Jianjun Ma, Degen Huang, Haixia Liu, and Wenfeng deis University, Waltham, MA. Sheng. 2011. POS Tagging of English Particles Ying Zhang and Stefan Vogel. 2010. Significance for Machine Translation. Proc. of MT Summit XIII, Tests of Automatic Machine Translation Evaluation pages 57–63, Xiamen. Metrics. Machine Translation, 24(1):51–65. Christopher Manning and Dan Klein. 2003. Opti- mization, MaxEnt Models, and Conditional Estima-

138 Author Index

Babych, Bogdan, 10, 101 Tinsley, John, 69 Banchs, Rafael, 30 Bandyopadhyay, Sivaji, 93 van Genabith, Josef, 48 Vassiliou, Marina, 65 Ceausu, Alexandru, 69 Vertan, Cristina, 59 Chng, Eng Siong, 30 Vintar, Spela,ˇ 87 Chong, Tze Yuang, 30 Vrsˇcaj,ˇ Aljosa,ˇ 87 Comas, Pere R., 20 Wang, Rui, 119 Dandapat, Sandipan, 48 Way, Andy, 48

Eberle, Kurt, 101 Zhang, Jian, 69 Espana-Bonet,˜ Cristina, 20

Federmann, Christian, 113 Fiser,ˇ Darja, 87

Geiß, Johanna, 101 Ginest´ı-Rosell, Mireia, 101 Gotoh, Yoshihiko, 38

Harriehausen-Muhlbauer,¨ Bettina, 1 Hartley, Anthony, 101 Heuss, Timm, 1

Khan, Muhammad Usman Ghani, 38 Kirkedal, Andreas Søeborg, 77 Koeva, Svetla, 72

Meyer, Thomas, 129 Morrissey, Sara, 48

Nawab, Rao Muhammad Adeel, 38

Osenova, Petya, 119

Pal, Santanu, 93 Popescu-Belis, Andrei, 129

Rapp, Reinhard, 101

Sharoff, Serge, 101 Simov, Kiril, 119 Sofianopoulos, Sokratis, 65 Su, Fangzhong, 10

Tambouratzis, George, 65 Thomas, Martin, 101

139