Resourcing Machine Translation with Parallel Treebanks John Tinsley
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by DCU Online Research Access Service Resourcing Machine Translation with Parallel Treebanks John Tinsley A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.) to the Dublin City University School of Computing Supervisor: Prof. Andy Way December 2009 I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Ph.D. is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breach any law of copyright, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. Signed: (Candidate) ID No.: Date: Contents Abstract vii Acknowledgements viii List of Figures ix List of Tables x 1 Introduction 1 2 Background and the Current State-of-the-Art 7 2.1 ParallelTreebanks ............................ 7 2.1.1 Sub-sentential Alignment . 9 2.1.2 Automatic Approaches to Tree Alignment . 12 2.2 Phrase-Based Statistical Machine Translation . ...... 14 2.2.1 WordAlignment ......................... 17 2.2.2 Phrase Extraction and Translation Models . 18 2.2.3 ScoringandtheLog-LinearModel . 22 2.2.4 LanguageModelling . 25 2.2.5 Decoding ............................. 27 2.3 Syntax-Based Machine Translation . 29 2.3.1 StatisticalTransfer-BasedMT . 30 2.3.2 Data-OrientedTranslation . 33 2.3.3 OtherApproaches ........................ 35 2.4 MTEvaluation.............................. 37 iii 2.4.1 BLEU............................... 37 2.4.2 NIST ............................... 39 2.4.3 METEOR ............................ 41 2.4.4 DrawbacksofAutomaticEvaluation . 43 2.4.5 StatisticalSignificance. 44 2.5 Summary................................. 44 3 Sub-Tree Alignment: development and evaluation 46 3.1 Prerequisites ............................... 47 3.1.1 Alignment Principles . 47 3.1.2 Alignment Well-Formedness Criteria . 49 3.2 Algorithm................................. 50 3.2.1 BasicConfiguration . 50 3.2.2 Resolving Competing Hypotheses (skip) . 52 3.2.3 Delaying Lexical Alignments (span) . 54 3.2.4 Calculating Hypothesis Scores . 55 3.3 AlignerEvaluation............................ 57 3.3.1 Data ............................... 58 3.3.2 IntrinsicEvaluation . 58 3.3.3 ExtrinsicEvaluation . 60 3.3.4 ManualEvaluation. 63 3.3.5 Discussion and Conclusions . 69 3.4 Summary................................. 71 4 Exploiting Parallel Treebanks in Phrase-based SMT 72 4.1 Supplementing PB-SMT with Syntax-Based Phrases: pilot experiments 73 4.1.1 DataResources .......................... 74 4.1.2 PhraseExtraction . .. .. 76 4.1.3 MTSystemSetup ........................ 78 4.1.4 Results .............................. 78 iv 4.1.5 Discussion ............................ 79 4.1.6 Summary............................. 84 4.2 Supplementing PB-SMT with Syntax-Based Phrases: scalingup . 85 4.2.1 ExperimentalSetup . 85 4.2.2 DirectPhraseCombination . 86 4.2.3 Prioritised Phrase Combination . 90 4.2.4 Weighting Syntax-Based Phrases . 93 4.2.5 FilteringTreebankData . 95 4.2.6 Training Set Size: Effect on Influence of Syntax-Based Phrase Pairs................................ 96 4.3 Exploring Further Uses of Parallel Treebanks in PB-SMT . .100 4.3.1 Treebank-Driven Phrase Extraction . 100 4.3.2 Treebank-Based Lexical Weighting . 104 4.4 New Language Pairs: IWSLT Participation . 106 4.4.1 TaskDescription . .107 4.4.2 Results ..............................108 4.4.3 Conclusions............................109 4.5 Comparing Constituency and Dependency Structures for Syntax-Based PhraseExtraction. .110 4.5.1 SyntacticAnnotations . .110 4.5.2 DataandExperimentalSetup. .113 4.5.3 Results ..............................115 4.5.4 Conclusions............................118 4.6 Summary .................................119 5 Exploiting Parallel Treebanks in Syntax-Based MT 121 5.1 DataandExperimentalSetup. .122 5.2 Stat-XFER: Exploiting Parallel Trees . 123 5.2.1 PhraseExtraction . .123 v 5.2.2 GrammarExtraction. .124 5.3 Stat-XFERResultsandDiscussion . 127 5.3.1 Automatically Derived Grammar: Results . 130 5.4 Phrase-Based Translation Experiments . .135 5.5 Summary .................................139 6 Conclusions 140 6.1 FutureWork ...............................143 Appendices 147 A English Parser Tag Set 147 B French Parser Tag Set 150 C 40-Rule Automatic Grammar 152 D Full Parse Trees 154 Bibliography 158 vi Abstract The benefits of syntax-based approaches to data-driven machine translation (MT) are clear: given the right model, a combination of hierarchical structure, constituent labels and morphological information can be exploited to produce more fluent, gram- matical translation output. This has been demonstrated by the recent shift in re- search focus towards such linguistically motivated approaches. However, one issue facing developers of such models that is not encountered in the development of state-of-the-art string-based statistical MT (SMT) systems is the lack of available syntactically annotated training data for many languages. In this thesis, we propose a solution to the problem of limited resources for syntax-based MT by introducing a novel sub-sentential alignment algorithm for the induction of translational equivalence links between pairs of phrase structure trees. This algorithm, which operates on a language pair-independent basis, allows for the automatic generation of large-scale parallel treebanks which are useful not only for machine translation, but also across a variety of natural language processing tasks. We demonstrate the viability of our automatically generated parallel treebanks by means of a thorough evaluation process during which they are compared to a man- ually annotated gold standard parallel treebank both intrinsically and in an MT task. Following this, we hypothesise that these parallel treebanks are not only useful in syntax-based MT, but also have the potential to be exploited in other paradigms of MT. To this end, we carry out a large number of experiments across a variety of data sets and language pairs, in which we exploit the information encoded within the parallel treebanks in various components of phrase-based statistical MT systems. We demonstrate that improvements in translation accuracy can be achieved by enhancing SMT phrase tables with linguistically motivated phrase pairs extracted from a parallel treebank, while showing that a number of other features in SMT can also be supplemented with varying degrees of effectiveness. Finally, we examine ways in which synchronous grammars extracted from parallel treebanks can improve the quality of translation output, focussing on real translation examples from a syntax- based MT system. vii Acknowledgements First and foremost, I’d like to extend a huge thanks to my supervisor Andy Way. He was the person who sparked my interest in MT initially and has been a constant source of pragmatic advice and encouragement, not only over the course of this thesis, but since I started in DCU back in 2002. He’s almost the ideal supervisor – if only he supported Liverpool. Secondly, I’d like to thank Mary Hearne, without whom the initial hurdles en- countered by me as a fresh-faced PhD student would have been a lot more difficult to overcome. She was a fountain of information not only on technical aspects of my work, but also in terms of practical advice. Thanks also to Ventsi for his collabora- tion and company as we began our theses together. I’d like to think we made things a little easier for one another. I wish to acknowledge the support of the various bodies who funded my research, notably Science Foundation Ireland and Microsoft Ireland, and the Irish Centre for High End Computing for the use of their resources. Big thanks go to the members of the NCLT/CNGL, both past and present — including Ankit, Declan, Grzegorz (for his help with the Spanish parser especially), Harold, Joachim, Josef, Karolina, Nicolas, Patrik, Sara, Sergio, Sylwia and Yvette to name a few/lot — for their questions, discussions and general interest regarding my work. And thanks to Augusto for weaning me off Windows before it was too late. At the beginning of 2009, I spent a number of weeks at Carnegie Mellon Uni- versity in Pittsburgh. This was a particularly enjoyable and fruitful period and for that I’d like to thank Alon Lavie. Thanks also to Greg, Vamshi and Jon for their discussions and help while I was there (and when I came back home), particularly in terms of getting the Stat-XFER system up and running so that I could write Chapter 5! and to Stephan for hosting me in Pittsburgh. A special thanks goes my friends, particularly Rose, for providing me with suffi- cient distraction from my work over the last 3+ years so that it never consumed me (too much). Finally, a wholehearted thank you to my family for their support, in all senses of the word, over the course of the last 8 years I have spent in university. Don’t worry, I’ll get a real job soon. viii List of Figures 2.1 An example English–Spanish parallel treebank entry . ....... 8 2.2 Example of a tree pair exhibiting lexical divergence. ........ 10 2.3 Example of varying granularity of information encapsulated in a tree alignment.................................. 12 2.4 An example of an English-to-Spanish word alignment. ...... 17 2.5 Example of the benefits of phrase-based translation over word-based models. .................................