The Role of Syntax and Semantics in Machine Translation and Quality Estimation of Machine-Translated User-Generated Content

The Role of Syntax and Semantics in Machine Translation and Quality Estimation of Machine-translated User-generated Content Rasoul Samad Zadeh Kaljahi B.Eng., M.Sc. A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.) to the Dublin City University School of Computing Supervisors: Dr. Jennifer Foster Dr. Johann Roturier (Symantec, Ireland) February 2015 I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Ph.D. is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breach any law of copyright, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. Signed: (Candidate) ID No.: Date: Contents Abbreviations xiv Abstract xv Acknowledgements xvi 1 Introduction1 1.1 Machine Translation...........................2 1.1.1 Research Questions........................4 1.1.2 Summary and Findings......................4 1.2 Quality Estimation............................5 1.2.1 Research Questions........................8 1.2.2 Summary and Findings......................9 1.3 The Syntax of Norton Forum Text................... 12 1.3.1 Research Questions........................ 13 1.3.2 Summary and Findings...................... 13 1.4 Thesis Structure.............................. 14 1.5 Publications................................ 15 2 Syntax in Statistical Machine Translation 16 2.1 Related Work............................... 18 2.2 Data.................................... 21 2.3 SMT Systems............................... 22 2.4 Automatic System Comparison..................... 24 i 2.4.1 Multiple Metrics......................... 24 2.4.2 One-to-one Comparison..................... 26 2.4.3 Sentence Level Comparison................... 28 2.4.4 N-best Comparison........................ 29 2.4.5 Oracle Combination....................... 30 2.5 Manual System Comparison....................... 32 2.6 Error Analysis............................... 34 2.7 Summary and Conclusion........................ 36 3 Quality Estimation of Machine Translation 38 3.1 Related Work............................... 40 3.2 Data.................................... 45 3.2.1 The SymForum Data Set..................... 47 3.2.2 The News Data Set........................ 54 3.3 Summary and Conclusion........................ 56 4 Syntax-based Quality Estimation 58 4.1 Related Work............................... 59 4.2 Syntax-based QE with News Data Set................. 65 4.2.1 Baseline QE Systems....................... 67 4.2.2 Syntax-based QE with Tree Kernels............... 68 4.2.3 Syntax-based QE with Hand-crafted Features......... 70 4.2.4 Combined Syntax-based QE................... 77 4.3 Syntax-based QE with SymForum Data Set.............. 78 4.3.1 Baseline QE Systems....................... 79 4.3.2 Syntax-based QE with Tree Kernels............... 80 4.3.3 Syntax-based QE with Hand-crafted Features......... 81 4.3.4 Combined Syntax-based QE................... 82 4.4 Summary and Conclusion........................ 83 ii 5 In-depth Analysis of Syntax-based Quality Estimation 85 5.1 Related Work............................... 87 5.2 Parser Accuracy in Syntax-based QE.................. 88 5.2.1 The News Data Set........................ 89 5.2.2 The SymForum Data Set..................... 93 5.3 Source and Target Syntax in QE.................... 94 5.4 Modifying French Parse Trees...................... 97 5.4.1 The News Data Set........................ 98 5.4.2 The SymForum Data Set..................... 100 5.4.3 Parser Accuracy Prediction................... 101 5.5 Summary and Conclusion........................ 102 6 Semantic Role Labelling of French 104 6.1 Related Work............................... 106 6.2 Experimental Setup............................ 108 6.2.1 Data................................ 108 6.2.2 SRL................................ 109 6.3 Experiments................................ 109 6.3.1 Learning Curve.......................... 109 6.3.2 Direct Translations........................ 111 6.3.3 Impact of Syntactic Annotation................. 112 6.3.4 The Impact of Word Alignment................. 115 6.3.5 Quality vs. Quantity....................... 117 6.3.6 How little is too little?...................... 119 6.4 Summary and Conclusion........................ 119 7 Semantic-based Quality Estimation 121 7.1 Related Work............................... 122 7.2 Data and Annotation........................... 126 7.3 Semantic-based QE with Tree Kernels.................. 128 iii 7.3.1 Dependency Tree Kernels.................... 128 7.3.2 Constituency Tree Kernels.................... 133 7.3.3 Combined Constituency and Dependency Tree Kernels.... 135 7.4 Semantic-based QE with Hand-crafted Features............ 137 7.5 Predicate-Argument Match (PAM)................... 140 7.5.1 Word Alignment-based PAM (WAPAM)............ 141 7.5.2 Lexical Translation Table-based PAM (LTPAM)........ 143 7.5.3 Phrase Translation Table-based PAM (PTPAM)........ 145 7.5.4 Analysing PAM.......................... 147 7.5.5 PAM Scores as Hand-crafted Features.............. 150 7.6 Combined Semantic-based QE System................. 151 7.7 Syntactico-semantic-based QE system.................. 152 7.8 Summary and Conclusion........................ 153 8 The Foreebank: Forum Treebank 156 8.1 Related Work............................... 159 8.2 Building the Foreebank.......................... 166 8.2.1 Handling Preprocessing Problems................ 167 8.2.2 Annotating Erroneous Structures................ 169 8.3 Analysing the Foreebank......................... 172 8.3.1 Characteristics of the Foreebank Text.............. 172 8.3.2 Characteristics of the Foreebank Annotations......... 173 8.3.3 User Error Rate.......................... 174 8.4 Parser Performance on the Foreebank.................. 176 8.5 The Effect of User Error on Parsing................... 178 8.6 Improving Parsing of Forum Text.................... 180 8.7 Semantic-based QE with Improved Parses............... 184 8.7.1 The Word Alignment-based PAM................ 185 8.7.2 The Semantic-based QE System................. 186 iv 8.8 Summary and Conclusion........................ 188 9 Conclusion and Future Work 190 9.1 Summary and Findings.......................... 191 9.2 Contributions............................... 199 9.3 Future Work................................ 201 Bibliography 204 Appendix A Quality Estimation Results1 v List of Figures 1.1 Example of machine translation of an English Norton forum sentence containing error (missing are) to French................2 3.1 Fine-grained word translation error categories of WMT 2014 QE shared task (adapted from Bojar et al.(2014))............. 43 3.2 Human-targeted score distribution histograms of the SymForum data 50 3.3 Adequacy and fluency score distribution histograms of the SymForum data set, for each evaluator and for their average............ 52 3.4 Score distribution histograms of the News data set........... 55 4.1 Tree Kernel Representation of Dependency Structure for And then the American era came. ......................... 69 5.1 Parse tree of the machine translation of Dark Matter Affects Flight of Space Probes to French........................ 96 5.2 Application of tree modification heuristics on example French translation parse trees............................. 98 6.1 Semantic role labelling of the WSJ sentence: The index rose 1.1% last month. ................................ 105 6.2 Learning curve with 100K training data of projected annotations... 110 6.3 Learning curve with 100K training data of projected annotations on only direct translations.......................... 112 vi 7.1 Two variations of dependency PAS formats for the sentence: Can anyone help? ............................... 129 7.2 Two Variations of dependency PST formats for the sentence: Can anyone help? ............................... 131 7.3 Two Variations of dependency SAS formats for the sentence: Can anyone help? ............................... 132 7.4 Constituency tree for the sentence Can anyone help? and the PST and SAS formats extracted from it. The minimal difference between (b) and (c) is specific to this example where the proposition subtree spans over the majority of the constituency tree nodes......... 134 7.5 LTPAM scoring algorithm........................ 143 8.1 Semantic role labelling of the sentence (-WFP- profile -WFP) Delete the three files and then restart Firefox. and its machine translation by the SRL systems in CoNLL-2009 format: 1) profile is mistakenly labelled as the predicate due to a wrong POS tag, 2) Delete is missed by the SRL system due to a wrong POS tag losing the match with supprimez despite a correct alignment.................. 158 8.2 Annotation of Example1 (The sentences are not fully shown due to space limit.)................................ 168 8.3 The Foreebank annotation of the sentence AutoMatic Log Inns Norton 360 3.0 corrected as Automatic Logins Norton 360 3.0 ........ 171 vii List of Tables 2.1 Evaluation scores on in-domain development set (translation memory) 25 2.2 Evaluation scores on out-of-domain development set (forum text).. 26 2.3 One-to-one BLEU Scores on in-domain development set (translation memory data)............................... 27 2.4 One-to-one BLEU Scores on out-of-domain development set (forum text).................................... 27 2.5

Load more