BDVA Valencia Summit 2016 Big Data Value Association 29 November - 2 December, 2016 Universitat Politecnica` de Valencia` Statistical and Neural Techniques for Interactive and Fully Automatic Translation
Francisco Casacuberta [email protected] Pattern Recognition and Human Language Tecnology research center Universitat Politecnica` de Valencia`
BDVA Valencia Summit 2016 Index
1 Machine translation . 2
2 Statistical and neural machine translation . 6
3 Human and machine translation . 12
4 Conclusions . 15
BDVA Valencia Summit 2016 1 Index
◦ 1 Machine translation . 2
2 Statistical and neural machine translation . 6
3 Human and machine translation . 12
4 Conclusions . 15
BDVA Valencia Summit 2016 2 Translation and Machine Translation
Translation: The process of translating words or text from one language into another 1.
Human translation industry: ≈ 666 million words / day
[Pym et al., The status of the translation profession in the European Union. 2012]
Machine translation: Translation carried out by a computer 1.
MT industry: 100 billion words / day
[Turovsky, Ten years of Google Translate. 2016]
1 English Oxford Dictionaries
BDVA Valencia Summit 2016 3 Machine Translation 1
• Information society and production of multilingual content: → 7 billion people, 193 countries, over 150 official languages.
• Globalization and demand for translation services: → 1,000 global companies operating in at least 160 countries.
• Size of worldwide translation market: → 12.5 billion $ per year ≈ 34 million $ per day.
• Size of translation industry: → 3,000 translation companies and > 250,000 translators.
• MT can improve productivity of human translators: → integration of MT with human translation (post–editing)
1M. Turchi. Introduction to machine translation. MT Marathon 2014
BDVA Valencia Summit 2016 4 Approaches to MT: Technologies
• (Linguistic) knowledge-based systems (KBS). (Systran, Apertium, etc.)
• (Memorized) example-based systems – Translation memories. (Trados, OmegaT, etc.)
• Statistical models – Alignment models. (Moses, Google, Jane, etc.) – Neural networks. (Google, etc.) – Other models.
• Hybrid models
BDVA Valencia Summit 2016 5 Index
1 Machine translation . 2
◦ 2 Statistical and neural machine translation . 6
3 Human and machine translation . 12
4 Conclusions . 15
BDVA Valencia Summit 2016 6 Statistical machine translation
• Every sentencey from a target language is considered as a possible translation of a given sentencex from a source language.
• Goal of statistical and neural machine translation: given a source sentence x search for a target sentence yˆ such that (risk minimization):
yˆ = argmax PM (y | x) y
• A SMT system: x Statistical machine y translation system
x , y 1 1 x , y 2 2 . . .
Off-line M Training
BDVA Valencia Summit 2016 7 Translation models
• Statistical alignment models – Discrete word representation. – Based on bilingual phrases (bilingual sequences of words) – Log-linear combination of different translation, language and other models. – Extensions: syntax-based translation, hierarchical phrase-based translation, ... – Moses (& GIZA++) is the most popular software.
• Neural models – Continuous word representation (word embedding) – An encoder-decoder model composed by (bidirectional) recurrent neuronal networks and an attention model. – GPUs are required. – SYSTRAN, Google and WIPO announce their Neural Machine Translation engine. (Sennrich et al. Neural Machine Translation. AMTA. 2016)
BDVA Valencia Summit 2016 8 Edinburgh’s WMT results over the years1
1Sennrich et al. Advances in Neural Machine Translation. AMTA. 2016. BDVA Valencia Summit 2016 9 Data, data, data, ...
• Bilingual/monolingual – Parallel corpora - bitext (moderate size). For example: Europarl (1.5GB) – Comparable corpora (moderate-high size) – Monolingual corpora (high size). For example: Gigaword (26GB), Common Craw (more than 100GB)
• Sources: – Administration. – Agencies. – Crowd sourcing. – Education. – News. – LDC, ELRA, TAUS, ... – ...
BDVA Valencia Summit 2016 10 Data, data, data, ...
• Some problems that can appear in data sets: – Noise – No alignment at sentence level. – Automatically induction of the morphology of inflectional languages. – Agglutinative languages. – Words not seen in the training data. – Extracting named entity translingual equivalences from bilingual parallel corpora. – ... • Other data types and MT technology – Multimodality in MT: Speech (i.e. Transtalk), gestures, handwritten text, graphical material, gaze tracking. – MT technology in other tasks with other data types: Image description, video description, visual question answering.
BDVA Valencia Summit 2016 11 Index
1 Machine translation . 2
2 Statistical and neural machine translation . 6
◦ 3 Human and machine translation . 12
4 Conclusions . 15
BDVA Valencia Summit 2016 12 Post-editing
• The current state-of-the-art in MT is useful for many applications,
• But this technology is still very far from allowing full automatic high- quality translations (HQT).
• However, this MT technologies are currently seen as promising approaches to help produce HQT cost-effectively.
• Post-editing (PE) is a first solution. While the number of errors and bad constructions is high, “post-editing” can make the result useful.
• PE produces a substantial time saving compared to human translation (CasMaCat a FP7 project, first field trial -2012-)
BDVA Valencia Summit 2016 13 Interactive machine translation (IMT)
• IMT, an alternative to PE: x y
Use a MT system to produce y’ feedback target text segments that can be accepted or amended by a Adaptive- x y human translator; the corrected interactive MT system segments are then used by x , y 1 1 x x , y y y’ the MT system as additional 2 2 information to achieve further, . . .
hopefully improved suggestions. Off-line M On-line Training Training
• Human interaction in PE and IMT offers the opportunity to improve IMT system’s behavior by tuning the translation models using the translations corrected by the user (online learning) • IMT becomes more productive than PE over time (CasMaCat a FP7 project, third field trial -2014-)
BDVA Valencia Summit 2016 14 Index
1 Machine translation . 2
2 Statistical and neural machine translation . 6
3 Human and machine translation . 12
◦ 4 Conclusions . 15
BDVA Valencia Summit 2016 15 Conclusions
• The current state-of-the-art (statistical or neural) MT technologies is enough for many applications.
• A lot of parallel corpora are necessary to build good MT systems.
• There are huge monolingual data and comparable corpora, but their exploitation is limited.
• Human activity is necessary to produce high-quality translations.
• Interactive machine translation offers an unique opportunity to produce high- quality translations with less human effort.
BDVA Valencia Summit 2016 16 Thank you!
BDVA Valencia Summit 2016 17