Workshop on Hybrid Approaches to Translation: Overview and Developments

Marta R. Costa-jussa,` Rafael E. Banchs Reinhard Rapp Institute for Infocomm Research1 Aix-Marseille Universite,´ LIF2 Patrik Lambert Kurt Eberle Bogdan Babych Barcelona Media3 Lingenio GmbH4 University of Leeds5

1{vismrc,rembanchs}@i2r.a-star.edu.sg, [email protected], [email protected], [email protected], [email protected]

Abstract A current increasing trend in machine translation is to combine data-driven and rule-based techniques. Such combinations typically involve the hybridization of different paradigms such as, for instance, the introduction of linguistic knowledge into statistical paradigms, the incorporation of data-driven components into rule- based paradigms, or the pre- and post- processing of either sort of translation sys- Figure 1: Vauquois pyramid (image from tem outputs. Aiming at bringing together Wikipedia). researchers and practitioners from the different multidisciplinary areas working in these directions, as well as at creating a years, the corpus-based approaches have reached brainstorming and discussion venue for a point where many researchers assume that rely- Hybrid Translation approaches, the Hy- ing exclusively on data might have serious limi- Tra initiative was born. This paper gives tations. Therefore, research has focused either on an overview of the Second Workshop on syntactical/hierarchical-based methods or on try- Hybrid Approaches to Translation (HyTra ing to augment the popular phrase-based systems 2013) concerning its motivation, contents by incorporating linguistic knowledge. In addi- and outcomes. tion, and given the fact that research on rule-based has never stopped, there have been several propos- 1 Introduction als of hybrid architectures combining both rule- Machine translation (MT) has continuously been based and data-driven approaches. evolving from different perspectives. Early sys- In summary, there is currently a clear trend to- tems were basically dictionary-based. These ap- wards hybridization, with researchers adding mor- proaches were further developed to more complex phological, syntactic and semantic knowledge to systems based on analysis, transfer and genera- statistical systems, as well as combining data- tion. The objective was to climb up (and down) driven methods with existing rule-based systems. in the well-known Vauquois pyramid (see Figure In this paper we provide a general overview 1) to facilitate the transfer phase or to even mini- of current approaches to hybrid MT within the mize the transfer by using an interlingua system. context of the Second Workshop on Hybrid Ap- But then, corpus-based approaches irrupted, gen- proaches to Translation (HyTra 2013). In our erating a turning point in the ﬁeld by putting aside overview, we classify hybrid MT approaches ac- the analysis, generation and transfer phases. cording to the linguistic levels that they address. Although there had been such a tendency right We then brieﬂy summarize the contributions pre- from the beginning (Wilks, 1994), in the last sented and collected in this volume.

1 Proceedings of the Second Workshop on Hybrid Approaches to Translation, pages 1–6, Sofia, Bulgaria, August 8, 2013. c 2013 Association for Computational Linguistics The paper is organized as follows. First, we mo- out that after about 25 years the statistical tivate and summarize the main aspects of the Hy- approach to MT has been widely accepted Tra initiative. Then, we present a general overview as an alternative to the classical approach of the accepted papers and discuss them within with manually designed rules. But in prac- the context of other state-of-the-art research in the tice most statistical MT systems make use area. Finally, we present our conclusions and dis- of manually designed rules at least for pre- cuss our proposed view of future directions for processing in order to improve MT quality. Hybrid MT research. This is exemplified by looking at the RWTH MT systems. 2 Overview of the HyTra Initiative 3 Hybrid Approaches Organized by The HyTra initiative started in response to the in- Linguistic Levels creasing interest in hybrid approaches to machine translation, which is reflected on the substantial ’Hybridization’ of MT can be understood as com- amount of work conducted on this topic. An- bination of several MT systems (possibly of very other important motivation was the observation different architecture) where the single systems that, up to now, no single paradigm has been able translate in parallel and compete for the best re- to successfully solve to a satisfactory extent all of sult (which is chosen by the integrating meta sys- the many challenges that the problem of machine tem). The workshop and the papers do not fo- translation poses. cus on this ’coarse-grained’ hybridization (Eisele The first HyTra workshop took part in conjunc- et al., 2008), but on a more ’fine grained’ one tion with the EACL 2012 conference (Costa-jussa` where the systems mix information from differ- et al., 2012). The Second HyTra Workshop, which ent levels of linguistic representations (see Fig- was co-organized by the authors of this paper, has ure 2). In the past and mostly in the framework been co-located with the ACL 2013 conference of rule-based machine translation (RBMT) it has (Costa-jussa` et al., 2013). The workshop has been been experimented with information from nearly supported by an extensive programme committee every level including phonetics and phonology comprising members from over 30 organizations for speech recognition and synthesis in speech- and representing more than 20 countries. As the to-speech systems (Wahlster, 2000) and includ- outcome of a comprehensive peer reviewing pro- ing pragmatics for dialog translation (Batliner et cess, and based on the recommendations of the al., 2000a; Batliner et al., 2000b) and text coher- programme committee, 15 papers were finally se- ence phenomena (Le Nagard and Koehn, 2010). lected for either oral or poster presentation at the With respect to work with emphasis on statisti- workshop. cal machine translation (SMT) and derivations of The workshop also had the privilege to be hon- it mainly those information levels have been used ored by two exceptional keynote speeches: that address text in the sense of sets of sentences. As most of the workshop papers relate to this • Controlled Ascent: Imbuing Statistical MT perspective - i.e. on hybridization which is de- with Linguistic Knowledge by Will Lewis and fined using SMT as backbone, in this introduc- Chris Quirk (2013), Microsoft research. The tion we can do with distinguishing between ap- intersection of rule-based and statistical approaches focused on morphology, syntax, and se- proaches in MT is explored, with a particular mantics. There are of course approaches which focus on past and current work done at Mi- deal with more than one of these levels in an in- crosoft Research. One of their motivations tegrated manner, which are commonly refered to for a hybrid approach is the observation that as multilevel approaches. As the case of treat- the times are over when huge improvements ing syntax and morphology concurrently is espe- in translation quality were possible by sim- cially common, we also consider morpho-syntax ply adding more data to statistical systems. as a separate multilevel approach. The reason is that most of the readily avail- 3.1 Morphological approaches able parallel data has already been found. The main approaches of statistical MT that ex- • How much hybridity do we have? by Her- ploit morphology can be classified into segmen- mann Ney, RWTH Aachen. It is pointed tation, generation, and enriching approaches. The

2 the last decade (or more) the approach evolved to or, respectively, was complemented by - work on syntax-based models in the linguistic sense of the word. Most such approaches can be classi- ﬁed into three different types of architecture that are deﬁned by the type of syntactic analysis used for the source language and the type of generation aimed at for the target language: tree-to-tree, tree- to-string and string-to-tree. Additionally, there are also the so called hierarchical systems, which combine the phrase-based and syntax-based approaches by using phrases as translation-units and automatically generated context free grammars as rules. Approaches dealing with the syntactic ap- Figure 2: Major linguistic levels (image from proach in HyTra 2013 include the following pa- Wikipedia). pers:

• Green and Zabokrtsky´ (2013) study three dif- first one attempts to minimize the vocabulary of ferent ways to ensemble parsing techniques highly inflected languages in order to symmetrize and provide results in MT. They compute cor- the (lexical granularity of the) source and the tar- relations between parsing quality and transla- get language. The second one assumes that, due tion quality, showing that NIST is more cor- to data sparseness, not all morphological forms related than BLEU. can be learned from parallel corpora and, therefore, proposes techniques to learn new morpho- • Han et al. (2013) provide a framework for logical forms. The last one tries to enrich poorly pre-reordering to make Chinese word order inflected languages to compensate for their lack of more similar to Japanese. To this purpose, morphology. In HyTra 2013, approaches treating they use unlabelled dependency structures of morphology were addressed by the following con- sentences and POS tags to identify verbal tributions: blocks and move them from after-the-object positions (SVO) to before-the-object posi- • Toral (2013) explores the selection of data to tions (SOV). train domain-specific language models (LM) from non-domain specific corpora by means • Nath Patel et al. (2013) also propose a pre- of simplified morphology forms (such as reordering technique, which uses a limited lemmas). The benefit of this technique is set of rules based on parse-tree modification tested using automatic metrics in the English- rules and manual revision. The set of rules is to-Spanish task. Results show an improve- specifically listed in detail. ment of up to 8.17% of perplexity reduction over the baseline system. • Saers et al. (2013) report an unsupervised learning model that induces phrasal ITGs by • Rios Gonzalez and Goehring (2013) propose breaking rules into smaller ones using mini- machine learning techniques to decide on the mum description length. The resulting trans- correct form of a verb depending on the con- lation model provides a basis for generaliza- text. Basically they use tree-banks to train the tion to more abstract transduction grammars classifiers. Results show that they are able with informative non-terminals. to disambiguate up to 89% of the Quechua verbs. 3.3 Morphosyntactical approaches 3.2 Syntactic approaches In linguistic theories, morphology and syntax are Syntax had been addressed originally in SMT in often considered and represented simultaneously the form of so called phrase-based SMT with- (not only in unification-based approaches) and the out any reference to linguistic structures; during same is true for MT systems.

3 • Laki et al. (2013) combine pre-reordering • Bouillon et al. (2013) presents two method- rules with morphological and factored mod- ologies to correct homophone confusions. els for English-to-Turkish. The ﬁrst one is based on hand-coded rules and the second one is based on weighted • Li et al. (2013) propose pre-reordering rules graphs derived from a pronunciation re- to be used for alignment-based reordering, source. and corresponding POS-based restructuring of the input. Basically, they focus on tak- 3.5 Other multilevel approaches ing advantage of the fact that Korean has In a number of linguistic theories information compound words, which - for the purpose of from the morphological, syntactic and semantic alignment - are split and reordered similarly level is considered conjointly and merged in cor- to Chinese. responding representations (a RBMT example is • Turki Khemakhem et al. (2013) present LFG (Lexical Functional Grammars) analysis and work about an English-Arabic SMT sys- the corresponding XLE translation architecture). tem that uses morphological decomposition In HyTra 2013 there are three approaches dealing and morpho-syntactic annotation of the target with multilevel information: language and incorporates the correspond- • Pal et al. (2013) propose a combination of ing information in a statistical feature model. aligners: GIZA++, Berkeley and rule-based Essentially, the statistical feature language for English-Bengali. model replaces words by feature arrays. • Hsieh et al. (2013) use comparable corpora 3.4 Semantic approaches extracted from Wikipedia to extract parallel The introduction of semantics in statistical MT has fragments for the purpose of extending an been approached to solve word sense disambigua- English-Bengali training corpus. tion challenges covering the area of lexical seman- • Tambouratzis et al. (2013) describe a hybrid tics and, more recently, there have been different MT architecture that uses very few bilingual techniques using semantic roles covering shallow corpus and a large monolingual one. The semantics, as well as the use of distributional se- linguistic information is extracted using mantics for improving translation unit selection. pattern recognition techniques. Approaches treating the incorporation of semantics into MT in HyTra 2013 include the following research work: Table 1 summarizes the papers that have been presented in the Second HyTra Workshop. The • Rudnick et al. (2013) present a combina- papers are arranged into the table according to the tion of Maximum Entropy Markov Models linguistic level they address. and HMM to perform lexical selection in the sense of cross-lingual word sense disam- 4 Conclusions and further work biguation (i.e. by choice from the set of translation alternatives). The system is meant to The success of the Second HyTra Workshop con- be integrated into a RBMT system. ﬁrms that research in hybrid approaches to MT systems is a very active and promising area. The • Boujelbane (2013) proposes to build a bilin- MT community seems to agree that pure data- gual lexicon for the Tunisian dialect us- driven or rule-based paradigms have strong lim- ing modern standard Arabic (MSA). The itations and that hybrid systems are a promising methodology is based on leveraging the large direction to overcome most of these limitations. available annotated MSA resources by ex- Considerable progress has been made in this area ploiting MSA-dialect similarities and ad- recently, as demonstrated by consistent improve- dressing the known differences. The author ments for different language pairs and translation studies morphological, syntactic and lexical tasks. differences by exploiting Penn Arabic Tree- The research community is working hard, with bank, and uses the differences to develop strong collaborations and with more resources at rules and to build dialectal concepts. hand than ever before. However, it is not clear

4 Morphological (Toral, 2013) Hybrid Selection of LM Training Data Using Linguistic Information and Perplexity (Gonzales and Goehring, 2013) Machine Learning disambiguation of Quechua verb morphology Syntax (Green and Zabokrtsky,´ 2013) Improvements to SBMT using Ensemble Dependency Parser (Han et al., 2013) Using unlabeled dependency parsing for pre-reordering for Chinese-to-Japanese SMT (Patel et al., 2013) Reordering rules for English-Hindi SMT (Saers et al., 2013) Unsupervised transduction grammar induction via MDL Morpho-syntactic (Laki et al., 2013) English to Hungarian morpheme-based SMT system with reordering rules (Li et al., 2013) Experiments with POS-based restructuring and alignment based reordering for SMT (Khemakhem et al., 2013) Integrating morpho-syntactic feature for English Arabic SMT Semantic (Rudnick and Gasser, 2013) Lexical Selection for Hybrid MT with Sequence Labeling (Boujelbane et al., 2013) Building bilingual lexicon to create dialect Tunisian corpora and adapt LM (Bouillon et al., 2013) Two approaches to correcting homophone confusions in a hybrid SMT based system Multilevels (Pal et al., 2013) A hybrid Word alignment model for PBSMT (Hsieh et al., 2013) Uses of monolingual in-domain corpora for cross-domain adaptation with hybrid MT approaches (Tambouratzis et al., 2013) Overview of a language-independent hybrid MT methodology

Table 1: HyTra 2013 paper overview. whether technological breakthroughs as in the past Pierrette Bouillon, Johanna Gerlach, Ulrich Germann, are still possible are still possible, or if MT will be Barry Haddow, and Manny Rayner. 2013. Two ap- turning into a research field with only incremen- proaches to correcting homophone confusions in a hybrid machine translation system. In ACL Work- tal advances. The question is: have we reached shop on Hybrid Machine Approaches to Translation, the point at which only refinements to existing ap- Sofia. proaches are needed? Or, on the contrary, do we Rahma Boujelbane, Mariem Ellouze khemekhem, Si- need a new turning point? war BenAyed, and Lamia HadrichBelguith. 2013. Our guess is that, similar to the inflection point Building bilingual lexicon to create dialect tunisian giving rise to the statistical MT approach during corpora and adapt language model. In ACL Work- the last decade of the twentieth century, once again shop on Hybrid Machine Approaches to Translation, Sofia. there might occur a new discovery which will rev- olutionize further the research on MT. We cannot Marta R. Costa-jussa,` Patrik Lambert, Rafael E. know whether hybrid approaches will be involved; Banchs, Reinhard Rapp, and Bogdan Babych, editors. 2012. Proceedings of the Joint Workshop on but, in any case, this seems to be a good and smart Exploiting Synergies between Information Retrieval direction as it is open to the full spectrum of ideas and Machine Translation (ESIRMT) and Hybrid Ap- and, thus, it should help to push the field forward. proaches to Machine Translation (HyTra). As- sociation for Computational Linguistics, Avignon, Acknowledgments France, April. Marta R. Costa-jussa,` Patrik Lambert, Rafael E. This workshop has been supported by the Sev- Banchs, Reinhard Rapp, Bogdan Babych, and Kurl enth Framework Program of the European Com- Eberle, editors. 2013. Proceedings of the Sec- mission through the Marie Curie actions HyghTra, ond Workshop on Hybrid Approaches to Translation IMTraP, AutoWordNet and CrossLingMind and (HyTra). Association for Computational Linguis- tics, Sofia, Bulgaria, August. the Spanish “Ministerio de Econom´ıa y Competi- tividad” and the European Regional Development Andreas Eisele, Christian Federmann, Hans Uszkoreit, Fund through SpeechTech4all. We would like to Herve´ Saint-Amand, Martin Kay, Michael Jelling- haus, Sabine Hunsicker, Teresa Herrmann, and thank the funding institution and all people who Yu Chen. 2008. Hybrid machine translation archi- contributed towards making the workshop a suc- tectures within and beyond the euromatrix project. cess. For a more comprehensive list of acknowl- In John Hutchins and Walther v.Hahn, editors, 12th edgments refer to the preface of this volume. annual conference of the European Association for Machine Translation (EAMT), pages 27–34, Ham- burg, Germany. References Annette Rios Gonzales and Anne Goehring. 2013. Machine learning disambiguation of quechua verb Anton Batliner, J. Buckow, Heinrich Niemann, Elmar morphology. In ACL Workshop on Hybrid Machine Noth,¨ and Volker Warnke, 2000a. The Prosody Approaches to Translation, Sofia. Module, pages 106–121. New York, Berlin. Nathan Green and Zdenek Zabokrtsky.´ 2013. Im- Anton Batliner, Richard Huber, Heinrich Niemann, El- provements to syntax-based machine translation us- mar Noth,¨ Jorg¨ Spilker, and K. Fischer, 2000b. The ing ensemble dependency parsers. In ACL Work- Recognition of Emotion, pages 122–130. New York, shop on Hybrid Machine Approaches to Translation, Berlin. Sofia.

5 Dan Han, Pascual Martinez-Gomez, Yusuke Miyao, Santanu Pal, Sudip Naskar, and Sivaji Bandyopadhyay. Katsuhito Sudoh, and Masaaki NAGATA. 2013. 2013. A hybrid word alignment model for phrase- Using unlabeled dependency parsing for pre- based statistical machine translation. In ACL Work- reordering for chinese-to-japanese statistical ma- shop on Hybrid Machine Approaches to Translation, chine translation. In ACL Workshop on Hybrid Ma- Sofia. chine Approaches to Translation, Sofia. Raj Nath Patel, Rohit Gupta, Prakash B. Pimpale, and An-Chang Hsieh, Hen-Hsen Huang, and Hsin-Hsi Sasikumar M. 2013. Reordering rules for english- Chen. 2013. Uses of monolingual in-domain cor- hindi smt. In ACL Workshop on Hybrid Machine pora for cross-domain adaptation with hybrid mt ap- Approaches to Translation, Sofia. proaches. In ACL Workshop on Hybrid Machine Ap- proaches to Translation, Sofia. Alex Rudnick and Michael Gasser. 2013. Lexical selection for hybrid mt with sequence labeling. In ACL Ines Turki Khemakhem, Salma Jamoussi, and Abdel- Workshop on Hybrid Machine Approaches to Trans- majid Ben Hamadou. 2013. Integrating morpho- lation, Sofia. syntactic feature in english-arabic statistical machine translation. In ACL Workshop on Hybrid Ma- Markus Saers, Karteek Addanki, and Dekai Wu. 2013. chine Approaches to Translation, Sofia. Unsupervised transduction grammar induction via minimum description length. In ACL Workshop on Laszl´ o´ Laki, Attila Novak, and Borbala´ Siklosi.´ 2013. Hybrid Machine Approaches to Translation, Sofia. English to hungarian morpheme-based statistical machine translation system with reordering rules. In George Tambouratzis, Sokratis Sofianopoulos, and ACL Workshop on Hybrid Machine Approaches to Marina Vassiliou. 2013. Language-independent hy- Translation, Sofia. brid mt with presemt. In ACL Workshop on Hybrid Machine Approaches to Translation, Sofia. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Antonio Toral. 2013. Hybrid selection of language Proceedings of the Joint Fifth Workshop on Statisti- model training data using linguistic information and cal Machine Translation and MetricsMATR, pages perplexity. In ACL Workshop on Hybrid Machine 252–261, Uppsala, Sweden, July. Association for Approaches to Translation, Sofia. Computational Linguistics. Wolfgang Wahlster, editor. 2000. Verbmobil: Foun- Will Lewis and Chris Quirk. 2013. Controlled ascent: dations of Speech-to-Speech Translation. Springer, Imbuing statistical mt with linguistic knowledge. In Berlin, Heidelberg, New York. ACL Workshop on Hybrid Machine Approaches to Translation, Sofia. Yorick Wilks. 1994. Stone soup and the french room: The empiricist-rationalist debate about ma- Shuo Li, Derek F. Wong, and Lidia S. Chao. chine translation. Current Issues in Computational 2013. Experiments with pos-based restructuring and Linguistics: in honor of Don Walker, pages 585– alignment-based reordering for statistical machine 594. Pisa, Italy: Giardini / Dordrecht, The Nether- translation. In ACL Workshop on Hybrid Machine lands: Kluwer Academic. Approaches to Translation, Sofia.