BDVA Valencia Summit 2016 Big Data Value Association 29 November - 2 December, 2016 Universitat Politecnica` de Valencia` Statistical and Neural Techniques for Interactive and Fully Automatic

Francisco Casacuberta [email protected] Pattern Recognition and Human Language Tecnology research center Universitat Politecnica` de Valencia`

BDVA Valencia Summit 2016 Index

1 . 2

2 Statistical and neural machine translation . 6

3 Human and machine translation . 12

4 Conclusions . 15

BDVA Valencia Summit 2016 1 Index

◦ 1 Machine translation . 2

2 Statistical and neural machine translation . 6

3 Human and machine translation . 12

4 Conclusions . 15

BDVA Valencia Summit 2016 2 Translation and Machine Translation

Translation: The process of translating words or text from one language into another 1.

Human translation industry: ≈ 666 million words / day

[Pym et al., The status of the translation profession in the European Union. 2012]

Machine translation: Translation carried out by a computer 1.

MT industry:  100 billion words / day

[Turovsky, Ten years of . 2016]

1 English Oxford Dictionaries

BDVA Valencia Summit 2016 3 Machine Translation 1

• Information society and production of multilingual content: → 7 billion people, 193 countries, over 150 official languages.

• Globalization and demand for translation services: → 1,000 global companies operating in at least 160 countries.

• Size of worldwide translation market: → 12.5 billion $ per year ≈ 34 million $ per day.

• Size of translation industry: → 3,000 translation companies and > 250,000 translators.

• MT can improve productivity of human translators: → integration of MT with human translation (post–editing)

1M. Turchi. Introduction to machine translation. MT Marathon 2014

BDVA Valencia Summit 2016 4 Approaches to MT: Technologies

• (Linguistic) knowledge-based systems (KBS). (Systran, Apertium, etc.)

• (Memorized) example-based systems – Translation memories. (Trados, OmegaT, etc.)

• Statistical models – Alignment models. (Moses, Google, Jane, etc.) – Neural networks. (Google, etc.) – Other models.

• Hybrid models

BDVA Valencia Summit 2016 5 Index

1 Machine translation . 2

◦ 2 Statistical and neural machine translation . 6

3 Human and machine translation . 12

4 Conclusions . 15

BDVA Valencia Summit 2016 6 Statistical machine translation

• Every sentencey from a target language is considered as a possible translation of a given sentencex from a source language.

• Goal of statistical and neural machine translation: given a source sentence x search for a target sentence yˆ such that (risk minimization):

yˆ = argmax PM (y | x) y

• A SMT system: x Statistical machine y translation system

x , y 1 1 x , y 2 2 . . .

Off-line M Training

BDVA Valencia Summit 2016 7 Translation models

• Statistical alignment models – Discrete word representation. – Based on bilingual phrases (bilingual sequences of words) – Log-linear combination of different translation, language and other models. – Extensions: syntax-based translation, hierarchical phrase-based translation, ... – Moses (& GIZA++) is the most popular software.

• Neural models – Continuous word representation (word embedding) – An encoder-decoder model composed by (bidirectional) recurrent neuronal networks and an attention model. – GPUs are required. – SYSTRAN, Google and WIPO announce their Neural Machine Translation engine. (Sennrich et al. Neural Machine Translation. AMTA. 2016)

BDVA Valencia Summit 2016 8 Edinburgh’s WMT results over the years1

1Sennrich et al. Advances in Neural Machine Translation. AMTA. 2016. BDVA Valencia Summit 2016 9 Data, data, data, ...

• Bilingual/monolingual – Parallel corpora - bitext (moderate size). For example: Europarl (1.5GB) – Comparable corpora (moderate-high size) – Monolingual corpora (high size). For example: Gigaword (26GB), Common Craw (more than 100GB)

• Sources: – Administration. – Agencies. – Crowd sourcing. – Education. – News. – LDC, ELRA, TAUS, ... – ...

BDVA Valencia Summit 2016 10 Data, data, data, ...

• Some problems that can appear in data sets: – Noise – No alignment at sentence level. – Automatically induction of the morphology of inflectional languages. – Agglutinative languages. – Words not seen in the training data. – Extracting named entity translingual equivalences from bilingual parallel corpora. – ... • Other data types and MT technology – Multimodality in MT: Speech (i.e. Transtalk), gestures, handwritten text, graphical material, gaze tracking. – MT technology in other tasks with other data types: Image description, video description, visual question answering.

BDVA Valencia Summit 2016 11 Index

1 Machine translation . 2

2 Statistical and neural machine translation . 6

◦ 3 Human and machine translation . 12

4 Conclusions . 15

BDVA Valencia Summit 2016 12 Post-editing

• The current state-of-the-art in MT is useful for many applications,

• But this technology is still very far from allowing full automatic high- quality (HQT).

• However, this MT technologies are currently seen as promising approaches to help produce HQT cost-effectively.

• Post-editing (PE) is a first solution. While the number of errors and bad constructions is high, “post-editing” can make the result useful.

• PE produces a substantial time saving compared to human translation (CasMaCat a FP7 project, first field trial -2012-)

BDVA Valencia Summit 2016 13 Interactive machine translation (IMT)

• IMT, an alternative to PE: x y

Use a MT system to produce y’ feedback target text segments that can be accepted or amended by a Adaptive- x y human translator; the corrected interactive MT system segments are then used by x , y 1 1 x x , y y y’ the MT system as additional 2 2 information to achieve further, . . .

hopefully improved suggestions. Off-line M On-line Training Training

• Human interaction in PE and IMT offers the opportunity to improve IMT system’s behavior by tuning the translation models using the translations corrected by the user (online learning) • IMT becomes more productive than PE over time (CasMaCat a FP7 project, third field trial -2014-)

BDVA Valencia Summit 2016 14 Index

1 Machine translation . 2

2 Statistical and neural machine translation . 6

3 Human and machine translation . 12

◦ 4 Conclusions . 15

BDVA Valencia Summit 2016 15 Conclusions

• The current state-of-the-art (statistical or neural) MT technologies is enough for many applications.

• A lot of parallel corpora are necessary to build good MT systems.

• There are huge monolingual data and comparable corpora, but their exploitation is limited.

• Human activity is necessary to produce high-quality translations.

• Interactive machine translation offers an unique opportunity to produce high- quality translations with less human effort.

BDVA Valencia Summit 2016 16 Thank you!

BDVA Valencia Summit 2016 17