Automatic Post-Editing for Machine Translation Rajen Chatterjee
Total Page:16
File Type:pdf, Size:1020Kb
DEPARTMENT OF INFORMATION ENGINEERING AND COMPUTER SCIENCE ICT International Doctoral School Automatic Post-Editing for Machine Translation Rajen Chatterjee Advisor Dr. Marco Turchi Fondazione Bruno Kessler Co-Advisor arXiv:1910.08592v1 [cs.CL] 18 Oct 2019 Matteo Negri Fondazione Bruno Kessler October 2019 Abstract Automatic Post-Editing (APE) aims to correct systematic errors in a ma- chine translated text. This is primarily useful when the machine translation (MT) system is not accessible for improvement, leaving APE as a viable op- tion to improve translation quality as a downstream task - which is the focus of this thesis. This field has received less attention compared to MT due to several reasons, which include: the limited availability of data to perform a sound research, contrasting views reported by different researchers about the effectiveness of APE, and limited attention from the industry to use APE in current production pipelines. In this thesis, we perform a thorough investigation of APE as a down- stream task in order to: i) understand its potential to improve translation quality; ii) advance the core technology - starting from classical methods to recent deep-learning based solutions; iii) cope with limited and sparse data; iv) better leverage multiple input sources; v) mitigate the task-specific problem of over-correction; vi) enhance neural decoding to leverage external knowledge; and vii) establish an online learning framework to handle data diversity in real-time. All the above contributions are discussed across several chapters, and most of them are evaluated in the APE shared task organized each year at the Conference on Machine Translation. Our efforts in improving the technology resulted in the best system at the 2017 APE shared task, and our work on online learning received a distinguished paper award at the Italian Conference on Computational Linguistics. Overall, outcomes and findings of our work have boost interest among researchers and attracted industries to examine this technology to solve real-word problems. Keywords Automatic Post-Editing, Machine Translation, Deep Learn- ing, Natural Language Processing, Neural Network Acknowledgement First and foremost, I want to thank my advisor Dr. Marco Turchi. It has been an honor to be his PhD student. I truly appreciate his guidance and moral support throughout my PhD pursuit. It has always been fun to discuss technical topics and brainstorm research ideas. He was always there to support my victory and share my failure. His guidance, suggestions, and feedback helped me to be a better person in my professional and personal life. Without his support this journey of PhD would have been incomplete. I also want to thank my co-advisor Matteo Negri to be my inspiration of creativity. His guidance helped me a lot to improve my skill in writing research papers, making presentation slides, and speaking in front of large audience about my work. He has always been the source of motivation to achieve my goals. I would like to thank all the members of the machine translation group at Fondazione Bruno Kessler for their valuable time to discuss research during meetings, presentations, and lunch break. Thanks to all the fellow Interns, Masters, and PhD students for making my journey joyful. A special thanks to all the people involved in providing infrastructure support and resolving any issue at their earliest. I also want to thank all the staff of the University of Trento and the Welcome Office to help me with all the bureaucratic work. Lastly, I want to thank my parents and all family members for their love, encouragement, and sacrifices. A special thanks to my sister Moushmi Baner- jee for guiding me through each step of my life and helping me to pursue the right career for which I was able to begin this exiting journey of PhD. Most of all thanks to my loving, caring, and encouraging wife Kaustri Bhattacharyya whose support and patient during the final stages of my Ph.D. is much ap- preciated. i ii Contents 1 Introduction 1 1.1 Motivation . .2 1.2 Problems . .3 1.2.1 Data . .3 1.2.2 Technology . .4 1.2.3 Application . .6 1.3 Outline . .7 1.4 Contribution . 10 2 MT Background 17 2.1 Translation Market . 17 2.2 Machine Translation . 19 2.2.1 Phrase-based Statistical MT (PBSMT) . 20 2.2.2 Neural Machine Translation (NMT) . 28 2.3 MT Evaluation . 32 2.3.1 Automatic Evaluation . 32 2.3.2 Human Evaluation . 35 2.4 Summary . 36 3 Automatic Post-Editing 37 3.1 Introduction . 37 3.2 Motivation . 39 iii 3.3 Literature Review . 40 3.3.1 APE Paradigm: Rule-based, Phrase-based, or Neural . 40 3.3.2 MT Engine Accessibility: Glass-box vs. Black-box . 42 3.3.3 Type of Post-Editing Data: Simulated vs. Real . 43 3.3.4 Domain of the Data: Generic vs. Specific . 44 3.3.5 Learning Mode: Offline vs. Online . 46 3.4 Findings from APE shared task at WMT . 47 3.4.1 The Pilot Round (2015) . 49 3.4.2 The Second Round (2016) . 52 3.4.3 The Third Round (2017) . 54 3.5 Summary . 56 4 Phrase-based APE 59 4.1 Introduction . 59 4.2 Monolingual APE . 60 4.3 Context-aware APE . 61 4.4 Effectiveness of APE for Domain-specific Data . 63 4.4.1 Reimplementing the two APE variants . 64 4.4.2 Experimental Setting . 64 4.4.3 Results . 66 4.5 Effectiveness of APE for Generic Data . 68 4.5.1 Experimental Settings . 68 4.5.2 APE Pipeline . 69 4.5.3 Full-Fledged Systems . 74 4.6 Summary . 75 5 Hybrid APE 77 5.1 Introduction . 77 5.2 Factored Phrase-based MT . 78 5.3 Factored Phrase-based APE . 79 iv 5.3.1 Factor Creation . 80 5.3.2 Neural Language Model . 80 5.4 Experimental Setting . 82 5.5 Results . 83 5.6 Summary . 87 6 End-to-End Neural APE 89 6.1 Introduction . 89 6.2 Multi-source Neural APE . 90 6.2.1 Experiments . 91 6.2.2 Results on Test Data . 97 6.3 Summary . 98 7 Combining APE and QE for MT 101 7.1 Introduction . 102 7.2 Quality Estimation . 103 7.3 Integrating External Knowledge in NMT Decoding . 104 7.3.1 Related Work . 105 7.3.2 Guided NMT decoding . 106 7.3.3 Experimental Settings . 112 7.3.4 Results . 113 7.4 QE and APE Integration Strategies . 116 7.4.1 QE as APE activator . 117 7.4.2 QE as APE guidance . 118 7.4.3 QE as MT/APE selector . 119 7.4.4 Experimental Settings . 120 7.4.5 Results and Discussion . 122 7.5 Summary . 128 v 8 Online Phrase-based APE 129 8.1 Introduction . 129 8.2 Related work . 131 8.3 Online APE system . 132 8.3.1 Instance selection . 133 8.3.2 Dynamic knowledge base . 135 8.3.3 Negative feedback in-the-loop . 136 8.4 Experimental Setting . 138 8.5 Results . 139 8.6 Analysis . 141 8.7 Summary . 145 9 Conclusion 147 9.1 Data . 147 9.2 Technology . 148 9.3 Applications . 150 9.4 Open Problems . 151 Bibliography 153 vi List of Tables 1.1 Examples illustrating the problem of over-correction in APE task. .6 3.1 Official results for the WMT15 Automatic Post-editing task. 49 3.2 Official results for the WMT16 Automatic Post-editing task. 52 3.3 Results for the WMT17 APE EN-DE task. 55 3.4 Results for the WMT17 APE DE-EN task. 57 4.1 An example of joint representation used in context-aware trans- lation. 62 4.2 Data statistics for each language. 65 4.3 Performance of the MT baseline and the APE methods for each language pair. Results for APE-2 marked with the “∗” symbol are statistically significant compared to APE-1. 66 4.4 Performance (TER score) of the APE systems using various LMs on dev set . 70 4.5 Performance (TER score) of the APE-1-LM1 after pruning at various threshold values on dev set . 72 4.6 Performance (TER score) of the APE-2-LM1 after pruning at various threshold values on dev set . 72 4.7 Performance (TER score) of the APE-1-LM1-Prun0.4 for dif- ferent features . 74 vii 4.8 Performance (TER score) of the APE-2-LM1-Prun0.2 for dif- ferent features on dev set . 74 4.9 APE shared task evaluation score (TER) . 75 5.1 Example of parallel corpus with factored representation. 79 5.2 Performance of the APE systems on development set (“y” in- dicates statistically significant differences wrt. Baseline with p<0.05). 83 5.3 Performance of the Factored APE-2 for various LMs (statisti- cal word-based LM is present in all the experiments by default). 85 5.4 Performance of the combined factored model for various tuning configurations. 85 5.5 Performance of the APE system with quality estimation for various thresholds. 86 5.6 Results of our submission to the 2016 WMT APE shared task (“y” indicates statistically significant differences wrt. Baseline (MT) with p<0.05). 87 6.1 Performance of the APE systems on dev. 2016 (EN-DE) (“y” indicates statistically significant differences wrt. MT_Baseline with p<0.05). 94 6.2 Performance of the APE systems on dev. 2017 (DE-EN ) (“y” indicates statistically significant differences wrt. MT_Baseline with p<0.05). 95 6.3 Performance of the APE systems on the 2016 WMT test set (EN-DE) (“y” indicates statistically significant differences wrt. MT_Baseline with p<0.05).