
BDVA Valencia Summit 2016 Big Data Value Association 29 November - 2 December, 2016 Universitat Politecnica` de Valencia` Statistical and Neural Techniques for Interactive and Fully Automatic Translation Francisco Casacuberta [email protected] Pattern Recognition and Human Language Tecnology research center Universitat Politecnica` de Valencia` BDVA Valencia Summit 2016 Index 1 Machine translation . 2 2 Statistical and neural machine translation . 6 3 Human and machine translation . 12 4 Conclusions . 15 BDVA Valencia Summit 2016 1 Index ◦ 1 Machine translation . 2 2 Statistical and neural machine translation . 6 3 Human and machine translation . 12 4 Conclusions . 15 BDVA Valencia Summit 2016 2 Translation and Machine Translation Translation: The process of translating words or text from one language into another 1. Human translation industry: ≈ 666 million words / day [Pym et al., The status of the translation profession in the European Union. 2012] Machine translation: Translation carried out by a computer 1. MT industry: 100 billion words / day [Turovsky, Ten years of Google Translate. 2016] 1 English Oxford Dictionaries BDVA Valencia Summit 2016 3 Machine Translation 1 • Information society and production of multilingual content: ! 7 billion people, 193 countries, over 150 official languages. • Globalization and demand for translation services: ! 1,000 global companies operating in at least 160 countries. • Size of worldwide translation market: ! 12.5 billion $ per year ≈ 34 million $ per day. • Size of translation industry: ! 3,000 translation companies and > 250,000 translators. • MT can improve productivity of human translators: ! integration of MT with human translation (post–editing) 1M. Turchi. Introduction to machine translation. MT Marathon 2014 BDVA Valencia Summit 2016 4 Approaches to MT: Technologies • (Linguistic) knowledge-based systems (KBS). (Systran, Apertium, etc.) • (Memorized) example-based systems – Translation memories. (Trados, OmegaT, etc.) • Statistical models – Alignment models. (Moses, Google, Jane, etc.) – Neural networks. (Google, etc.) – Other models. • Hybrid models BDVA Valencia Summit 2016 5 Index 1 Machine translation . 2 ◦ 2 Statistical and neural machine translation . 6 3 Human and machine translation . 12 4 Conclusions . 15 BDVA Valencia Summit 2016 6 Statistical machine translation • Every sentencey from a target language is considered as a possible translation of a given sentencex from a source language. • Goal of statistical and neural machine translation: given a source sentence x search for a target sentence y^ such that (risk minimization): y^ = argmax PM (y j x) y • A SMT system: x Statistical machine y translation system x , y 1 1 x , y 2 2 . Off-line M Training BDVA Valencia Summit 2016 7 Translation models • Statistical alignment models – Discrete word representation. – Based on bilingual phrases (bilingual sequences of words) – Log-linear combination of different translation, language and other models. – Extensions: syntax-based translation, hierarchical phrase-based translation, ... – Moses (& GIZA++) is the most popular software. • Neural models – Continuous word representation (word embedding) – An encoder-decoder model composed by (bidirectional) recurrent neuronal networks and an attention model. – GPUs are required. – SYSTRAN, Google and WIPO announce their Neural Machine Translation engine. (Sennrich et al. Neural Machine Translation. AMTA. 2016) BDVA Valencia Summit 2016 8 Edinburgh’s WMT results over the years1 1Sennrich et al. Advances in Neural Machine Translation. AMTA. 2016. BDVA Valencia Summit 2016 9 Data, data, data, ... • Bilingual/monolingual – Parallel corpora - bitext (moderate size). For example: Europarl (1.5GB) – Comparable corpora (moderate-high size) – Monolingual corpora (high size). For example: Gigaword (26GB), Common Craw (more than 100GB) • Sources: – Administration. – Agencies. – Crowd sourcing. – Education. – News. – LDC, ELRA, TAUS, ... – ... BDVA Valencia Summit 2016 10 Data, data, data, ... • Some problems that can appear in data sets: – Noise – No alignment at sentence level. – Automatically induction of the morphology of inflectional languages. – Agglutinative languages. – Words not seen in the training data. – Extracting named entity translingual equivalences from bilingual parallel corpora. – ... • Other data types and MT technology – Multimodality in MT: Speech (i.e. Transtalk), gestures, handwritten text, graphical material, gaze tracking. – MT technology in other tasks with other data types: Image description, video description, visual question answering. BDVA Valencia Summit 2016 11 Index 1 Machine translation . 2 2 Statistical and neural machine translation . 6 ◦ 3 Human and machine translation . 12 4 Conclusions . 15 BDVA Valencia Summit 2016 12 Post-editing • The current state-of-the-art in MT is useful for many applications, • But this technology is still very far from allowing full automatic high- quality translations (HQT). • However, this MT technologies are currently seen as promising approaches to help produce HQT cost-effectively. • Post-editing (PE) is a first solution. While the number of errors and bad constructions is high, “post-editing” can make the result useful. • PE produces a substantial time saving compared to human translation (CasMaCat a FP7 project, first field trial -2012-) BDVA Valencia Summit 2016 13 Interactive machine translation (IMT) • IMT, an alternative to PE: x y Use a MT system to produce y’ feedback target text segments that can be accepted or amended by a Adaptive- x y human translator; the corrected interactive MT system segments are then used by x , y 1 1 x x , y y y’ the MT system as additional 2 2 information to achieve further, . hopefully improved suggestions. Off-line M On-line Training Training • Human interaction in PE and IMT offers the opportunity to improve IMT system’s behavior by tuning the translation models using the translations corrected by the user (online learning) • IMT becomes more productive than PE over time (CasMaCat a FP7 project, third field trial -2014-) BDVA Valencia Summit 2016 14 Index 1 Machine translation . 2 2 Statistical and neural machine translation . 6 3 Human and machine translation . 12 ◦ 4 Conclusions . 15 BDVA Valencia Summit 2016 15 Conclusions • The current state-of-the-art (statistical or neural) MT technologies is enough for many applications. • A lot of parallel corpora are necessary to build good MT systems. • There are huge monolingual data and comparable corpora, but their exploitation is limited. • Human activity is necessary to produce high-quality translations. • Interactive machine translation offers an unique opportunity to produce high- quality translations with less human effort. BDVA Valencia Summit 2016 16 Thank you! BDVA Valencia Summit 2016 17.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages18 Page
-
File Size-