2: Theory and Praxis of Decoding P Statistical Machine Translation Statistical Machine Translation P Lecture 2

Total Page:16

File Type:pdf, Size:1020Kb

2: Theory and Praxis of Decoding P Statistical Machine Translation Statistical Machine Translation P Lecture 2 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation Statistical Machine Translation p Lecture 2 ¯ Components: Translation model, language model, decoder Theory and Praxis of Decoding foreign/English English Philipp Koehn parallel text text [email protected] statistical analysis statistical analysis School of Informatics University of Edinburgh Translation Language Model Model Decoding Algorithm – p.1 – p.2 Philipp Koehn, University of Edinburgh 2 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Phrase-Based Systems p Phrase-Based Translation p ¯ A number of research groups developed phrase-based Morgen fliege ich nach Kanada zur Konferenz systems ( RWTH Aachen, Univ. of Southern California/ISI, CMU, IBM, Johns Hopkins Univ., Cambridge Univ., Univ. of Catalunya, ITC-irst, Univ. Edinburgh, Univ. of Maryland...) Tomorrow I will fly to the conference in Canada ¯ Systems differ in – training methods ¯ Foreign input is segmented in phrases – model for phrase translation table – any sequence of words, not necessarily linguistically motivated – reordering models ¯ Each phrase is translated into English – additional feature functions ¯ Phrases are reordered ¯ Currently best method for SMT (MT?) – top systems in DARPA/NIST evaluation are phrase-based – best commercial system for Arabic-English is phrase-based – p.3 – p.4 Philipp Koehn, University of Edinburgh 3 Philipp Koehn, University of Edinburgh 4 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Phrase Translation Table p Decoding Process p Maria no dio una bofetada a la bruja verde ¯ Phrase Translations for “den Vorschlag”: j j English (e f) English (e f) the proposal 0.6227 the suggestions 0.0114 ’s proposal 0.1068 the proposed 0.0114 a proposal 0.0341 the motion 0.0091 ¯ Build translation left to right the idea 0.0250 the idea of 0.0091 – select foreign words to be translated this proposal 0.0227 the proposal , 0.0068 proposal 0.0205 its proposal 0.0068 of the proposal 0.0159 it 0.0068 the proposals 0.0159 ... ... – p.5 – p.6 Philipp Koehn, University of Edinburgh 5 Philipp Koehn, University of Edinburgh 6 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Decoding Process p Decoding Process p Maria no dio una bofetada a la bruja verde Maria no dio una bofetada a la bruja verde Mary Mary ¯ ¯ Build translation left to right Build translation left to right – select foreign words to be translated – select foreign words to be translated – find English phrase translation – find English phrase translation – add English phrase to end of partial translation – add English phrase to end of partial translation – mark foreign words as translated – p.7 – p.8 Philipp Koehn, University of Edinburgh 7 Philipp Koehn, University of Edinburgh 8 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Decoding Process p Decoding Process p Maria no dio una bofetada a la bruja verde Maria no dio una bofetada a la bruja verde Mary did not Mary did not slap ¯ ¯ One to many translation Many to one translation – p.9 – p.10 Philipp Koehn, University of Edinburgh 9 Philipp Koehn, University of Edinburgh 10 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Decoding Process p Decoding Process p Maria no dio una bofetada a la bruja verde Maria no dio una bofetada a la bruja verde Mary did not slap the Mary did not slap the green ¯ ¯ Many to one translation Reordering – p.11 – p.12 Philipp Koehn, University of Edinburgh 11 Philipp Koehn, University of Edinburgh 12 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Decoding Process p Translation Options p Maria no dio una bofetada a la bruja verde Maria no dio una bofetada a labruja verde Mary not give a slap to the witch green did not a slap by green witch no slap to the did not give to the Mary did not slap the green witch slap the witch ¯ Look up possible phrase translations ¯ Translation finished – many different ways to segment words into phrases – many different ways to translate each phrase – p.13 – p.14 Philipp Koehn, University of Edinburgh 13 Philipp Koehn, University of Edinburgh 14 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Hypothesis Expansion p Hypothesis Expansion p Maria no dio una bofetada a labruja verde Maria no dio una bofetada a labruja verde Mary not give a slap to the witch green Mary not give a slap to the witch green did not a slap by green witch did not a slap by green witch no slap to the no slap to the did not give to did not give to the the slap the witch slap the witch e: e: e: Mary f: --------- f: --------- f: *-------- p: 1 p: 1 p: .534 ¯ Start with empty hypothesis ¯ Pick translation option – e: no English words ¯ Create hypothesis – f: no foreign words covered – e: add English phrase Mary – p: probability 1 – f: first foreign word covered – p: probability 0.534 – p.15 – p.16 Philipp Koehn, University of Edinburgh 15 Philipp Koehn, University of Edinburgh 16 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p A Quick Word on Probabilities p Hypothesis Expansion p Maria no dio una bofetada a labruja verde ¯ Not going into detail here, but... Mary not give a slap to the witch green did not a slap by green witch ¯ Translation Model no slap to the did not give to – phrase translation probability p(MaryjMaria) the slap the witch – reordering costs e: witch f: -------*- – phrase/word count costs p: .182 e: e: Mary – ... f: --------- f: *-------- p: 1 p: .534 ¯ Language Model – uses trigrams: ¯ Add another hypothesis > j < > j < – p(Mary did not) = p(Maryj s ) * p(did Mary, s ) * p(not Mary did) – p.17 – p.18 Philipp Koehn, University of Edinburgh 17 Philipp Koehn, University of Edinburgh 18 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Hypothesis Expansion p Hypothesis Expansion p Maria no dio una bofetada a labruja verde Maria no dio una bofetada a la bruja verde Mary not give a slap to the witch green Mary not give a slap to the witch green did not a slap by green witch did not a slap by green witch no slap to the no slap to the did not give to did not give to the the slap the witch slap the witch e: witch e: ... slap e: witch e: slap f: -------*- f: *-***---- f: -------*- f: *-***---- p: .182 p: .043 p: .182 p: .043 e: e: Mary e: e: Mary e: did not e: slap e: the e:green witch f: --------- f: *-------- f: --------- f: *-------- f: **------- f: *****---- f: *******-- f: ********* p: 1 p: .534 p: 1 p: .534 p: .154 p: .015 p: .004283 p: .000271 ¯ ... until all foreign words covered ¯ Further hypothesis expansion – find best hypothesis that covers all foreign words – backtrack to read off translation – p.19 – p.20 Philipp Koehn, University of Edinburgh 19 Philipp Koehn, University of Edinburgh 20 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Hypothesis Expansion p Explosion of Search Space p Maria no dio una bofetada a labruja verde ¯ Number of hypotheses is exponential with respect to Mary not give a slap to the witch green did not a slap by green witch sentence length no slap to the did not give to the slap the witch µ Decoding is NP-complete [Knight, 1999] e: witch e: slap f: -------*- f: *-***---- µ Need to reduce search space p: .182 p: .043 e: e: Mary e: did not e: slap e: the e:green witch – risk free: hypothesis recombination f: --------- f: *-------- f: **------- f: *****---- f: *******-- f: ********* p: 1 p: .534 p: .154 p: .015 p: .004283 p: .000271 – risky: histogram/threshold pruning ¯ Adding more hypothesis µ Explosion of search space – p.21 – p.22 Philipp Koehn, University of Edinburgh 21 Philipp Koehn, University of Edinburgh 22 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Hypothesis Recombination p Hypothesis Recombination p p=0.092 p=0.092 p=1 p=0.534 p=0.092 p=1 p=0.534 p=0.092 Mary did not give Mary did not give did not did not give give p=0.164 p=0.044 p=0.164 ¯ ¯ Different paths to the same partial translation Different paths to the same partial translation µ Combine paths – drop weaker hypothesis – keep pointer from worse path – p.23 – p.24 Philipp Koehn, University of Edinburgh 23 Philipp Koehn, University of Edinburgh 24 Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Statistical Machine Translation — Lecture 2: Theory and Praxis of Decoding p Hypothesis Recombination p Hypothesis Recombination p p=0.092 p=0.017 p=0.092 did not give did not give Joe Joe p=1 p=0.534 p=0.092 p=1 p=0.534 p=0.092 Mary did not give Mary did not give did not did not give give p=0.164 p=0.164 ¯ ¯ Recombined hypotheses
Recommended publications
  • (Meta-) Evaluation of Machine Translation
    (Meta-) Evaluation of Machine Translation Chris Callison-Burch Cameron Fordyce Philipp Koehn Johns Hopkins University CELCT University of Edinburgh ccb cs jhu edu fordyce celct it pkoehn inf ed ac uk Christof Monz Josh Schroeder Queen Mary, University of London University of Edinburgh christof dcs qmul ac uk j schroeder ed ac uk Abstract chine translation evaluation workshop. Here we ap- ply eleven different automatic evaluation metrics, This paper evaluates the translation quality and conduct three different types of manual evalu- of machine translation systems for 8 lan- ation. guage pairs: translating French, German, Beyond examining the quality of translations pro- Spanish, and Czech to English and back. duced by various systems, we were interested in ex- We carried out an extensive human evalua- amining the following questions about evaluation tion which allowed us not only to rank the methodologies: How consistent are people when different MT systems, but also to perform they judge translation quality? To what extent do higher-level analysis of the evaluation pro- they agree with other annotators? Can we im- cess. We measured timing and intra- and prove human evaluation? Which automatic evalu- inter-annotator agreement for three types of ation metrics correlate most strongly with human subjective evaluation. We measured the cor- judgments of translation quality? relation of automatic evaluation metrics with This paper is organized as follows: human judgments. This meta-evaluation re- veals surprising facts about the most com- • Section 2 gives an overview of the shared task. monly used methodologies. It describes the training and test data, reviews the baseline system, and lists the groups that 1 Introduction participated in the task.
    [Show full text]
  • Public Project Presentation and Updates
    7.2: Public Project Presentation and Updates Philipp Koehn Distribution: Public EuroMatrix Statistical and Hybrid Machine Translation Between All European Languages IST 034291 Deliverable 7.2 December 21, 2007 Project funded by the European Community under the Sixth Framework Programme for Research and Technological Development. Project ref no. IST-034291 Project acronym EuroMatrix Project full title Statistical and Hybrid Machine Translation Between All Eu- ropean Languages Instrument STREP Thematic Priority Information Society Technologies Start date / duration 01 September 2006 / 30 Months Distribution Public Contractual date of delivery September 1, 2007 Actual date of delivery September 13, 2007 Deliverable number 7.2 Deliverable title Public Project Presentation and Updates Type Report, contractual Status & version Finished Number of pages 19 Contributing WP(s) WP7 WP / Task responsible WP7 / 7.6 Other contributors none Author(s) Philipp Koehn EC project officer Xavier Gros Keywords The partners in EuroMatrix are: Saarland University (USAAR) University of Edinburgh (UEDIN) Charles University (CUNI-MFF) CELCT GROUP Technologies MorphoLogic For copies of reports, updates on project activities and other EuroMatrix-related in- formation, contact: The EuroMatrix Project Co-ordinator Prof. Hans Uszkoreit Universit¨at des Saarlandes, Computerlinguistik Postfach 15 11 50 66041 Saarbr¨ucken, Germany [email protected] Phone +49 (681) 302-4115- Fax +49 (681) 302-4700 Copies of reports and other material can also be accessed via the project’s homepage: http://www.euromatrix.net/ c 2007, The Individual Authors No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner.
    [Show full text]
  • 8.14: Annual Public Report
    8.14: Annual Public Report S. Busemann, O. Bojar, C. Callison-Burch, M. Cettolo, R. Garabik, J. van Genabith, P. Koehn, H. Schwenk, K. Simov, P. Wolf Distribution: Public EuroMatrixPlus Bringing Machine Translation for European Languages to the User ICT 231720 Deliverable 8.14 Project funded by the European Community under the Seventh Framework Programme for Research and Technological Development. Project ref no. ICT-231720 Project acronym EuroMatrixPlus Project full title Bringing Machine Translation for European Languages to the User Instrument STREP Thematic Priority ICT-2007.2.2 Cognitive systems, interaction, robotics Start date / duration 01 March 2009 / 38 Months Distribution Public Contractual date of delivery November 15, 2011 Actual date of delivery December 05, 2011 Date of last update December 5, 2011 Deliverable number 8.14 Deliverable title Annual Public Report Type Report Status & version Final Number of pages 13 Contributing WP(s) all WP / Task responsible DFKI Other contributors All Participants Author(s) S. Busemann, O. Bojar, C. Callison-Burch, M. Cettolo, R. Garabik, J. van Genabith, P. Koehn, H. Schwenk, K. Simov, P. Wolf EC project officer Michel Brochard Keywords The partners in DFKI GmbH, Saarbr¨ucken (DFKI) EuroMatrixPlus University of Edinburgh (UEDIN) are: Charles University (CUNI-MFF) Johns Hopkins University (JHU) Fondazione Bruno Kessler (FBK) Universit´edu Maine, Le Mans (LeMans) Dublin City University (DCU) Lucy Software and Services GmbH (Lucy) Central and Eastern European Translation, Prague (CEET) Ludovit Stur Institute of Linguistics, Slovak Academy of Sciences (LSIL) Institute of Information and Communication Technologies, Bulgarian Academy of Sciences (IICT-BAS) For copies of reports, updates on project activities and other EuroMatrixPlus-related information, contact: The EuroMatrixPlus Project Co-ordinator Prof.
    [Show full text]
  • Statistical and Hybrid Machine Translation Between All European Languages
    Statistical and hybrid machine translation between all European languages Publishable executive summary 01 September 2006 – 30 November 2007 Acronym: EuroMatrix Contract N°: IST 034291 Duration: 30 months from 01/09/2006 Coordinator: Prof. Dr. Hans Uszkoreit Saarland University D-66123 Saarbrücken [email protected] Scientific coordinator: Dr. Philipp Koehn Edinburgh University [email protected] PROJECT FUNDED BY THE EUROPEAN COMMUNITY UNDER THE SIXTH FRAMEWORK PROGRAMME FOR RESEARCH AND TECHNOLOGICAL DEVELOPMENT. Contractors involved Saarland University Germany Computational Linguistics The University of Edinburgh United Kingdom School of Informatics Univerzita Karlova v Praze/ Charles University Prague Czech Republic Institute of Formal and Applied Linguistics CELCT Center for the Evaluation of Language and Italy Communication Technologies MorphoLogic Hungary Germany Group Technologies AG -------------------------------------------------------------------------------------------- Publishable Executive Summary ©The EuroMatrix consortium, 2008 Summary description of project objectives EuroMatrix aims at a major push in machine translation technology applying the most advanced MT technologies systematically to all pairs of EU languages. Special attention is being paid to the languages of the new and near-term prospective member states. As part of this application development, EuroMatrix designs and investigates novel combinations of statistical techniques and linguistic knowledge sources as well as hybrid MT architectures. EuroMatrix addresses urgent European economic and social needs by concentrating on European languages and on high-quality translation to be employed for the publication of technical, social, legal and political documents. EuroMatrix aims at enriching the statistical MT approach with novel learning paradigms and experiment with new combinations of methods and resources from statistical MT, rule-based MT, shallow language processing and computational lexicography/morphology.
    [Show full text]
  • Abstract the Circle of Meaning: from Translation
    ABSTRACT Title of dissertation: THE CIRCLE OF MEANING: FROM TRANSLATION TO PARAPHRASING AND BACK Nitin Madnani, Doctor of Philosophy, 2010 Dissertation directed by: Professor Bonnie Dorr Department of Computer Science The preservation of meaning between inputs and outputs is perhaps the most ambitious and, often, the most elusive goal of systems that attempt to process natural language. Nowhere is this goal of more obvious importance than for the tasks of machine translation and paraphrase generation. Preserving meaning between the input and the output is paramount for both, the monolingual vs bilingual distinction notwithstanding. In this thesis, I present a novel, symbiotic relationship between these two tasks that I term the “circle of meaning”. Today’s statistical machine translation (SMT) systems require high quality human translations for parameter tuning, in addition to large bi-texts for learning the translation units. This parameter tuning usually involves generating translations at different points in the parameter space and obtaining feedback against human- authored reference translations as to how good the translations. This feedback then dictates what point in the parameter space should be explored next. To measure this feedback, it is generally considered wise to have multiple (usually 4) reference translations to avoid unfair penalization of translation hypotheses which could easily happen given the large number of ways in which a sentence can be translated from one language to another. However, this reliance on multiple reference translations creates a problem since they are labor intensive and expensive to obtain. Therefore, most current MT datasets only contain a single reference. This leads to the problem of reference sparsity—the primary open problem that I address in this dissertation— one that has a serious effect on the SMT parameter tuning process.
    [Show full text]
  • Improving Statistical Machine Translation Efficiency by Triangulation
    Improving Statistical Machine Translation Efficiency by Triangulation Yu Chen, Andreas Eisele, Martin Kay Saarland University Saarbrucken,¨ Germany {yuchen, eisele, kay}@coli.uni-sb.de Abstract In current phrase-based Statistical Machine Translation systems, more training data is generally better than less. However, a larger data set eventually introduces a larger model that enlarges the search space for the decoder, and consequently requires more time and more resources to translate. This paper describes an attempt to reduce the model size by filtering out the less probable entries based on testing correlation using additional training data in an intermediate third language. The central idea behind the approach is triangulation, the process of incorporating multilingual knowledge in a single system, which eventually utilizes parallel corpora available in more than two languages. We conducted experiments using Europarl corpus to evaluate our approach. The reduction of the model size can be up to 70% while the translation quality is being preserved. 1. Introduction 2. Phrase-based SMT The dominant paradigm in SMT is referred to as phrase- Statistical machine translation (SMT) is now generally based machine translation. The term refers to a sequence of taken to be an enterprise in which machine learning tech- words characterized by its statistical rather then any gram- niques are applied to a bilingual corpus to produce a trans- matical properties. At the center of the process is a phrase lation system entirely automatically. Such a scheme has table, a list of phrases identified in a source sentence to- many potential advantages over earlier systems which re- gether with potential translations.
    [Show full text]
  • Factored Translation Models and Discriminative Training For
    Machine Translation at Edinburgh Factored Translation Models and Discriminative Training Philipp Koehn, University of Edinburgh 9 July 2007 Philipp Koehn, University of Edinburgh EuroMatrix 9 July 2007 1 Overview • Intro: Machine Translation at Edinburgh • Factored Translation Models • Discriminative Training Philipp Koehn, University of Edinburgh EuroMatrix 9 July 2007 2 The European Challenge Many languages • 11 official languages in EU-15 • 20 official languages in EU-25 • many more minority languages Challenge • European reports, meetings, laws, etc. • develop technology to enable use of local languages as much as possible Philipp Koehn, University of Edinburgh EuroMatrix 9 July 2007 3 Existing MT systems for EU languages [from Hutchins, 2005] Cze Dan Dut Eng Est Fin Fre Ger Gre Hun Ita Lat Lit Mal Pol Por Slo Slo Spa Swe Czech –..1..11..1.........4 Danish .–.....1............1 Dutch ..–6..21............9 English 2 . 6 – . 42 48 3 3 29 1 . 7 30 2 . 48 1 222 Estonian ....–...............0 Finnish ...2.–.1............3 French 1 . 2 38 . – 22 3 . 9 . 1 5 . 10 . 91 German 1 1 1 49 . 1 23 – . 1 8 . 4 3 2 . 8 1 103 Greek ...2..3.–...........5 Hungarian ...1...1.–..........2 Italian 1 . 25 . 9 8 . – . 1 3 . 7 . 54 Latvian ...1.......–........1 Lithuanian ............–.......0 Maltese .............–......0 Polish ...6..13..1...–2..1.14 Portuguese . 25 . 4 4 . 3 . 1 – . 6 . 43 Slovak ...1...1........–...2 Slovene .................–..0 Spanish 1 . 42 . 8 7 . 7 . 1 6 . – . 72 Swedish ...2...1...........–3 6 1 9 201 0 1 93 99 6 4 58 1 0 0 15 49 4
    [Show full text]
  • Convergence of Translation Memory and Statistical Machine Translation
    Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn Jean Senellart University of Edinburgh Systran 10 Crichton Street La Grande Arche Edinburgh, EH8 9AB 1, Parvis de la Defense´ Scotland, United Kingdom 92044 Paris, France [email protected] [email protected] Abstract tems (Koehn, 2010) are built by fully automati- cally analyzing translated text and learning the rules. We present two methods that merge ideas SMT has been embraced by the academic and com- from statistical machine translation (SMT) mercial research communities as the new dominant and translation memories (TM). We use a TM paradigm in machine translation. Almost all re- to retrieve matches for source segments, and cently published papers on machine translation are replace the mismatched parts with instructions to an SMT system to fill in the gap. We published on new SMT techniques. The method- show that for fuzzy matches of over 70%, one ology has left the research labs and become the method outperforms both SMT and TM base- basis of successful companies such as Language lines. Weaver and the highly visible Google and Microsoft web translation services. Even traditional rule-based companies such as Systran have embraced statisti- 1 Introduction cal methods and integrated them into their systems Two technological advances in the field of au- (Dugast et al., 2007). tomated language translation, translation memory The two technologies have not touched much in (TM) and statistical machine translation (SMT), the past not only because of the different devel- have seen vast progress over the last decades, but opment communities (software suppliers to transla- they have been developed very much in isolation.
    [Show full text]
  • Statistical Machine Translation
    ESSLLI Summer School 2008 Statistical Machine Translation Chris Callison-Burch, Johns Hopkins University Philipp Koehn, University of Edinburgh Intro to Statistical MT EuroMatrix MT Marathon Chris Callison-Burch Various approaches • Word-for-word translation • Syntactic transfer • Interlingual approaches • Controlled language • Example-based translation • Statistical translation 3 Advantages of SMT • Data driven • Language independent • No need for staff of linguists of language experts • Can prototype a new system quickly and at a very low cost Statistical machine translation • Find most probable English sentence given a foreign language sentence • Automatically align words and phrases within sentence pairs in a parallel corpus • Probabilities are determined automatically by training a statistical model using the parallel corpus 4 Parallel corpus what is more , the relevant cost im übrigen ist die diesbezügliche dynamic is completely under control. kostenentwicklung völlig unter kontrolle . sooner or later we will have to be früher oder später müssen wir die sufficiently progressive in terms of own notwendige progressivität der eigenmittel als resources as a basis for this fair tax grundlage dieses gerechten steuersystems system . zur sprache bringen . we plan to submit the first accession wir planen , die erste beitrittspartnerschaft partnership in the autumn of this year . im herbst dieses jahres vorzulegen . it is a question of equality and solidarity hier geht es um gleichberechtigung und . solidarität . the recommendation for the year 1999 die empfehlung für das jahr 1999 wurde vor has been formulated at a time of dem hintergrund günstiger entwicklungen favourable developments and optimistic und einer für den kurs der europäischen prospects for the european economy . wirtschaft positiven perspektive abgegeben .
    [Show full text]
  • Edinburgh's Statistical Machine Translation Systems for WMT16
    Edinburgh’s Statistical Machine Translation Systems for WMT16 Philip Williams1, Rico Sennrich1, Maria Nadejde˘ 1, Matthias Huck2, Barry Haddow1, Ondrejˇ Bojar3 1School of Informatics, University of Edinburgh 2Center for Information and Language Processing, LMU Munich 3Institute of Formal and Applied Linguistics, Charles University in Prague [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Abstract core syntax-based setup and experiments in Sec- tions 4 and 5. This paper describes the University of Ed- inburgh’s phrase-based and syntax-based 2 Phrase-based System Overview submissions to the shared translation tasks of the ACL 2016 First Conference on Ma- 2.1 Preprocessing chine Translation (WMT16). We sub- The training data was preprocessed us- mitted five phrase-based and five syntax- ing scripts from the Moses toolkit. based systems for the news task, plus one We first normalized the data using the phrase-based system for the biomedical normalize-punctuation.perl script, task. then performed tokenization (using the -a op- tion), and then truecasing. We did not perform 1 Introduction any corpus filtering other than the standard Moses method, which removes sentence pairs with Edinburgh’s submissions to the WMT 2016 news extreme length ratios, and sentences longer than translation task fall into two distinct groups: neu- 80 tokens. ral translation systems and statistical translation systems. In this paper, we describe the statisti- 2.2 Word Alignment cal systems, which includes a mix of phrase-based and syntax-based approaches.
    [Show full text]
  • Improved Learning for Machine Translation
    This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”. This project has received funding from the European Union’s Horizon 2020 program for ICT under grant agreement no. 645452. Deliverable D1.5 Improved Learning for Machine Translation Ondřej Bojar (CUNI), Jan-Thorsten Peter (RWTH), Weiyue Wang (RWTH), Tamer Alkhouli (RWTH), Yunsu Kim (RWTH), Miloš Stanojević (UvA), Khalil Sima’an (UvA), and Jon Dehdari (DFKI) Dissemination Level: Public 31st July, 2016 Quality Translation 21 D1.5: Improved Learning for Machine Translation Grant agreement no. 645452 Project acronym QT21 Project full title Quality Translation 21 Type of action Research and Innovation Action Coordinator Prof. Josef van Genabith (DFKI) Start date, duration 1st February, 2015, 36 months Dissemination level Public Contractual date of delivery 31st July, 2016 Actual date of delivery 31st July, 2016 Deliverable number D1.5 Deliverable title Improved Learning for Machine Translation Type Report Status and version Final (Version 1.0) Number of pages 78 Contributing partners CUNI, DFKI, RWTH, UVA WP leader CUNI Author(s) Ondřej Bojar (CUNI), Jan-Thorsten Peter (RWTH), Weiyue Wang (RWTH), Tamer Alkhouli (RWTH), Yunsu Kim (RWTH), Miloš Stanojević (UvA), Khalil Sima’an (UvA), and Jon Dehdari (DFKI) EC project officer Susan Fraser The partners in QT21 are: • Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany • Rheinisch-Westfälische Technische Hochschule Aachen (RWTH), Germany • Universiteit van Amsterdam (UvA), Netherlands • Dublin City University (DCU), Ireland • University of Edinburgh (UEDIN), United Kingdom • Karlsruher Institut für Technologie (KIT), Germany • Centre National de la Recherche Scientifique (CNRS), France • Univerzita Karlova v Praze (CUNI), Czech Republic • Fondazione Bruno Kessler (FBK), Italy • University of Sheffield (USFD), United Kingdom • TAUS b.v.
    [Show full text]
  • A Survey of Statistical Machine Translation
    LAMP-TR-135 April 2007 CS-TR-4831 UMIACS-TR-2006-47 A Survey of Statistical Machine Translation Adam Lopez Computational Linguistics and Information Processing Laboratory Institute for Advanced Computer Studies Department of Computer Science University of Maryland College Park, MD 20742 [email protected] Abstract Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular techniques have only emerged within the last few years. This survey presents a tutorial overview of state-of-the-art SMT at the beginning of 2007. We begin with the context of the current research, and then move to a formal problem description and an overview of the four main subproblems: translational equivalence modeling, mathematical modeling, parameter estimation, and decoding. Along the way, we present a taxonomy of some different approaches within these areas. We conclude with an overview of evaluation and notes on future directions. This is a revised draft of a paper currently under review. The contents may change in later drafts. Please send any comments, questions, or corrections to [email protected]. Feel free to cite as University of Maryland technical report UMIACS-TR-2006-47. The support of this research by the GALE program of the Defense Advanced Research Projects Agency, Contract No. HR0011-06-2-0001, ONR MURI Contract FCPO.810548265, and Department of Defense contract RD-02-5700 is acknowledged. Contents 1 Introduction 1 1.1 Background and Context .
    [Show full text]