Viewgraphs for a Machine Translation Course. DSIC

Viewgraphs for a Machine Translation Course. DSIC

BDVA Valencia Summit 2016 Big Data Value Association 29 November - 2 December, 2016 Universitat Politecnica` de Valencia` Statistical and Neural Techniques for Interactive and Fully Automatic Translation Francisco Casacuberta [email protected] Pattern Recognition and Human Language Tecnology research center Universitat Politecnica` de Valencia` BDVA Valencia Summit 2016 Index 1 Machine translation . 2 2 Statistical and neural machine translation . 6 3 Human and machine translation . 12 4 Conclusions . 15 BDVA Valencia Summit 2016 1 Index ◦ 1 Machine translation . 2 2 Statistical and neural machine translation . 6 3 Human and machine translation . 12 4 Conclusions . 15 BDVA Valencia Summit 2016 2 Translation and Machine Translation Translation: The process of translating words or text from one language into another 1. Human translation industry: ≈ 666 million words / day [Pym et al., The status of the translation profession in the European Union. 2012] Machine translation: Translation carried out by a computer 1. MT industry: 100 billion words / day [Turovsky, Ten years of Google Translate. 2016] 1 English Oxford Dictionaries BDVA Valencia Summit 2016 3 Machine Translation 1 • Information society and production of multilingual content: ! 7 billion people, 193 countries, over 150 official languages. • Globalization and demand for translation services: ! 1,000 global companies operating in at least 160 countries. • Size of worldwide translation market: ! 12.5 billion $ per year ≈ 34 million $ per day. • Size of translation industry: ! 3,000 translation companies and > 250,000 translators. • MT can improve productivity of human translators: ! integration of MT with human translation (post–editing) 1M. Turchi. Introduction to machine translation. MT Marathon 2014 BDVA Valencia Summit 2016 4 Approaches to MT: Technologies • (Linguistic) knowledge-based systems (KBS). (Systran, Apertium, etc.) • (Memorized) example-based systems – Translation memories. (Trados, OmegaT, etc.) • Statistical models – Alignment models. (Moses, Google, Jane, etc.) – Neural networks. (Google, etc.) – Other models. • Hybrid models BDVA Valencia Summit 2016 5 Index 1 Machine translation . 2 ◦ 2 Statistical and neural machine translation . 6 3 Human and machine translation . 12 4 Conclusions . 15 BDVA Valencia Summit 2016 6 Statistical machine translation • Every sentencey from a target language is considered as a possible translation of a given sentencex from a source language. • Goal of statistical and neural machine translation: given a source sentence x search for a target sentence y^ such that (risk minimization): y^ = argmax PM (y j x) y • A SMT system: x Statistical machine y translation system x , y 1 1 x , y 2 2 . Off-line M Training BDVA Valencia Summit 2016 7 Translation models • Statistical alignment models – Discrete word representation. – Based on bilingual phrases (bilingual sequences of words) – Log-linear combination of different translation, language and other models. – Extensions: syntax-based translation, hierarchical phrase-based translation, ... – Moses (& GIZA++) is the most popular software. • Neural models – Continuous word representation (word embedding) – An encoder-decoder model composed by (bidirectional) recurrent neuronal networks and an attention model. – GPUs are required. – SYSTRAN, Google and WIPO announce their Neural Machine Translation engine. (Sennrich et al. Neural Machine Translation. AMTA. 2016) BDVA Valencia Summit 2016 8 Edinburgh’s WMT results over the years1 1Sennrich et al. Advances in Neural Machine Translation. AMTA. 2016. BDVA Valencia Summit 2016 9 Data, data, data, ... • Bilingual/monolingual – Parallel corpora - bitext (moderate size). For example: Europarl (1.5GB) – Comparable corpora (moderate-high size) – Monolingual corpora (high size). For example: Gigaword (26GB), Common Craw (more than 100GB) • Sources: – Administration. – Agencies. – Crowd sourcing. – Education. – News. – LDC, ELRA, TAUS, ... – ... BDVA Valencia Summit 2016 10 Data, data, data, ... • Some problems that can appear in data sets: – Noise – No alignment at sentence level. – Automatically induction of the morphology of inflectional languages. – Agglutinative languages. – Words not seen in the training data. – Extracting named entity translingual equivalences from bilingual parallel corpora. – ... • Other data types and MT technology – Multimodality in MT: Speech (i.e. Transtalk), gestures, handwritten text, graphical material, gaze tracking. – MT technology in other tasks with other data types: Image description, video description, visual question answering. BDVA Valencia Summit 2016 11 Index 1 Machine translation . 2 2 Statistical and neural machine translation . 6 ◦ 3 Human and machine translation . 12 4 Conclusions . 15 BDVA Valencia Summit 2016 12 Post-editing • The current state-of-the-art in MT is useful for many applications, • But this technology is still very far from allowing full automatic high- quality translations (HQT). • However, this MT technologies are currently seen as promising approaches to help produce HQT cost-effectively. • Post-editing (PE) is a first solution. While the number of errors and bad constructions is high, “post-editing” can make the result useful. • PE produces a substantial time saving compared to human translation (CasMaCat a FP7 project, first field trial -2012-) BDVA Valencia Summit 2016 13 Interactive machine translation (IMT) • IMT, an alternative to PE: x y Use a MT system to produce y’ feedback target text segments that can be accepted or amended by a Adaptive- x y human translator; the corrected interactive MT system segments are then used by x , y 1 1 x x , y y y’ the MT system as additional 2 2 information to achieve further, . hopefully improved suggestions. Off-line M On-line Training Training • Human interaction in PE and IMT offers the opportunity to improve IMT system’s behavior by tuning the translation models using the translations corrected by the user (online learning) • IMT becomes more productive than PE over time (CasMaCat a FP7 project, third field trial -2014-) BDVA Valencia Summit 2016 14 Index 1 Machine translation . 2 2 Statistical and neural machine translation . 6 3 Human and machine translation . 12 ◦ 4 Conclusions . 15 BDVA Valencia Summit 2016 15 Conclusions • The current state-of-the-art (statistical or neural) MT technologies is enough for many applications. • A lot of parallel corpora are necessary to build good MT systems. • There are huge monolingual data and comparable corpora, but their exploitation is limited. • Human activity is necessary to produce high-quality translations. • Interactive machine translation offers an unique opportunity to produce high- quality translations with less human effort. BDVA Valencia Summit 2016 16 Thank you! BDVA Valencia Summit 2016 17.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    18 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us