Automatic Error Detection and Correction in Neural Machine Translation

Automatic Error Detection and Correction in Neural Machine Translation A comparative study of Swedish to English and Greek to English Anthi Papadopoulou Uppsala universitet Department of Linguistics and Philology Master’s Programme in Language Technology Master’s Thesis in Language Technology June 10, 2019 Supervisors: Eva Pettersson, Uppsala University Anna Sågvall Hein, Convertus AB Sebastian Schleussner, Convertus AB Abstract Automatic detection and automatic correction of machine translation output are important steps to ensure an optimal quality of the final output. In this work, we compared the output of neural machine translation of two different language pairs, Swedish to English and Greek to English. This comparison was made using common machine translation metrics (BLEU, METEOR, TER) and syntax-related ones (POSBLEU, WPF, WER on POS classes). It was found that neither common metrics nor purely syntax-related ones were able to capture the quality of the machine translation output accurately, but the decomposition of WER over POS classes was the most informative one. A sample of each language was taken, so as to aid in the comparison between manual and automatic error categorization of five error categories, namely reordering errors, inflectional errors, missing and extra words, and incorrect lexical choices. Both Spearman’s r and Pearson’s r showed that there is a good correlation with human judgment with values above 0.9. Finally, based on the results of this error categorization, automatic post editing rules were implemented and applied, and their performance was checked against the sample, and the rest of the data set, showing varying results. The impact on the sample was greater, showing improvement in all metrics, while the impact on the rest of the data set was negative. An investigation of that, alongside the fact that correction was not possible for Greek due to extremely free reference translations and lack of error patterns in spoken speech, reinforced the belief that automatic post-editing is tightly connected to consistency in the reference translation, while also proving that in machine translation output handling, potentially more than one reference translations would be needed to ensure better results. Contents Acknowledgments5 1. Introduction6 1.1. Purpose................................6 1.2. Outline................................7 2. Background8 2.1. Machine Translation.........................8 2.1.1. Rule-Based Machine Translation...............8 2.1.2. Statistical Machine Translation...............8 2.1.3. Neural Networks and Neural Machine Translation.....9 2.1.4. Google Translate....................... 10 2.1.5. Comparison of NMT to Other MT Methods........ 10 2.2. NLP-Cube............................... 11 2.3. Evaluation of MT Output...................... 11 2.3.1. Human Evaluation...................... 11 2.3.2. Error Categorization..................... 12 2.3.3. Automatic Evaluation.................... 13 2.3.3.1. Common Metrics................. 13 2.3.3.2. Syntax-Oriented Evaluation Metrics....... 16 2.4. Automatic Post-Editing........................ 18 2.5. Convertus............................... 19 3. Data 20 3.1. Swedish to English (StE)....................... 20 3.2. Greek to English (GtE)........................ 20 4. Method 21 5. Evaluation 23 5.1. Common Metrics........................... 23 5.1.1. BLEU............................. 23 5.1.2. METEOR........................... 23 5.1.3. TER.............................. 23 5.1.4. Results and Short Discussion................ 23 5.2. Syntax-Oriented Metrics....................... 25 5.2.1. POSBLEU........................... 25 5.2.2. Decomposition of WER on POS classes.......... 26 5.2.3. WPF.............................. 27 5.2.4. Results and Short Discussion................ 28 5.3. Error Categorization......................... 31 5.3.1. Manual Error Categorization................ 32 3 5.3.2. Automatic Error Categorization............... 32 5.3.3. Results and Short Discussion................ 34 5.3.4. Spearman’s r and Peason’s r rank correlation coefficients. 38 5.3.5. Pipeline and HTML Visualization.............. 39 6. Automatic Post-Editing 40 6.1. GtE sample data........................... 40 6.2. StE sample data............................ 40 6.3. Results and Short Discussion..................... 41 6.4. APE on the Rest of the Data..................... 42 7. Conclusion and Future Work 45 A. Pipeline 48 B. HTML 49 4 Acknowledgments I am grateful to my academic supervisor Eva Pettersson, for sharing her knowledge with me, answering all my questions, and tirelessly correcting all of my typos. Special gratitude to my supervisors at Convertus, Anna Sågvall Hein and Sebastian Schleussner, for taking me on this thesis project, and providing me with valuable data, tools, and most importantly expertise. I would also like to thank my partner, family, and friends for being helpful, patient, and understanding with me throughout this period. 5 1. Introduction Neural networks (NN) have been studied and are being used in the field of computational linguistics, thus evolving a lot. In machine translation (MT), one can come across various techniques, from rule-based (RB), to statistical (SMT), to example based, and finally to approaches using the current ever so popular neural models for neural machine translation (NMT). No matter how advanced the procedures that are used are, though, there are always ungrammatical structures in the output, such as varying verb inflections or lack of source information, such as negations. Evaluating the output of this procedure and detecting errors (also known as quality estimation, QE) is undoubtedly a tiring and time-consuming activity for human evaluators. This is why one can observe a shift from manual human evaluation to a more automatic approach that allows re-use and faster results. The same effort is being put into the process of automatic post-editing (PE), which is considered vital in MT. Manual post-editing is considered tiring since MT systems tend to repeat the same mistakes over and over again. This is why it is more desirable to develop automatic post-editing mechanisms that will be informed of the errors detected and thus, result in a good output quality, while also minimizing the human factor in the procedure. The thesis project was carried out at Convertus AB, a machine translation services company in Uppsala, Sweden, which was created as a spin-off from Uppsala University. Convertus AB offers complete machine translation services for quality output, including automatic language checking and machine translation, as well as automatic and manual post-editing of the translated text. 1.1. Purpose The purpose of this thesis is to evaluate and detect errors in the output of neural machine translation from Swedish and Greek, to English, using different kinds of metrics, and to formulate and apply automatic post editing rules, so as to check the impact of them on the output quality. The tool NLP-Cube1 will be used to aid in the procedure of detecting ungrammatical structures, such as varying verb inflections, or lack of source information, such as negations, etc. Common quality metrics are to be calculated for the output, and then they are to be compared to syntax-related metrics, based on the part-of-speech (POS) information derived by the tool. After these errors are categorized, they are then passed to post-editing mechanisms which will be formulated to achieve a higher quality output. There are several research questions that this thesis wishes to address throughout the tasks: 1http://opensource.adobe.com/NLP-Cube/index.html 6 • How much do common evaluation metrics actually represent the quality of the MT output? • How can syntax aid in the process of evaluating MT output quality? • What types of errors exist in neural machine translation output? How can they be categorized manually? What about automatically? How do they differ for the Swedish-to-English and Greek-to-English pairs? • What types of rules could be formulated for the automatic post-editing step? Can they be common or do they have to be language specific? • What is the effect of these rules to the quality of the output? 1.2. Outline The outline of the thesis is the following. In section 2, a background of the study is given, which discusses neural machine translation and its comparison to statistical machine translation, followed by an introduction to NLP-Cube, the tool used in this study for grammatical analysis. After that studies proposing methods for error analysis based on POS-tags will be proposed, and common metrics for evaluation that are heavily used will be discussed. Following that, automatic PE will be presented. In section 3, the data used will be presented alongside the methods used, with the results being shown in section 4. An in-depth discussion of the results will be presented in section 5 followed by automatic PE efforts and their results in section 6. Finally in section 7, some suggestions for improvement and some guidance for future work will be given. 7 2. Background 2.1. Machine Translation Machine translation (MT) is the process of translating text from one language to another using a computer, and it has always been a growing area in the field of computational linguistics, dating back to the 1950’s. In its base form it substi- tutes words from the source language to the target language. As one can easily understand this cannot guarantee good translations, since a language does not only consist of words put one

Automatic Error Detection and Correction in Neural Machine Translation

Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation

Arxiv:2108.09484V1 [Cs.CL] 21 Aug 2021

LEPOR: a Robust Evaluation Metric for Machine Translation with Augmented Factors

International Journal of Computer Sciences and Engineering Open Access Research Paper Vol-6, Special Issue-5, Jun 2018 E-ISSN: 2347-2693

An Experimental Study of Automatic Metrics for Machine Translation Evaluation K.Sourabh1, S.M Aaqib2, Minu Bala3 1Assistant Professor, Dept

LEPOR: an Augmented Machine Translation Evaluation Metric Lifeng

Translation Quality Assessment: a Brief Survey on Manual and Automatic Methods

Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation

A Critical Analysis of Metrics Used for Measuring Progress in Artificial Intelligence