Exploiting Linguistic and Statistical Knowledge in A
Total Page:16
File Type:pdf, Size:1020Kb
Exploiting Linguistic and Statistical Knowledge in a Text Alignment System Dissertation zur Erlangung des Grades eines Doktors der Kognitionswissenschaft eingereicht am Fachbereich Humanwissenschaften der Universitat¨ Osnabruck¨ vorgelegt von Bettina Schrader Osnabruck,¨ Marz¨ 2007 Exploiting linguistic and statistical knowledge 0.0 Bettina Schrader 2 Dissertation Uberarbeitete¨ Fassung beruhend auf der begutachteten Fassung mit dem Titel Exploiting Linguistic and Statistical Knowledge in a Text Alignment System Dissertation zur Erlangung des Grades eines Doktors der Kognitionswissenschaft, eingereicht am Fachbereich Humanwissenschaften der Universitat¨ Osnabruck¨ Acknowledgements It is a laborious way from a first, sketchy idea for a thesis to its final version. In my case, the idea was born when I was writing the final section of my Master’s thesis at the Institut fur¨ Maschinelle Sprachverarbeitung in Stuttgart. Now, four years later, I’ve completed my dissertation at the Institut fur¨ Kognitionswissenschaft in Osnabruck,¨ a vital, lively place where everyone is open to new ideas, and concepts, sometimes radically so. And so I am deeply indebted to my supervisors Prof. Dr. Peter Bosch, Dr. habil Helmar Gust and Prof. Dr. Stefan Evert for encouraging, and supporting me, to elaborate my sketchy idea to a fully grown dissertation – a text alignment system that differs radically from any I’ve ever heard of before. Additionally, many thanks to Martin Volk for co-refereeing this thesis. Special thanks also go to Dr. Carla Umbach, the coordinator of the Cognitive Science Graduate Program, and the members of the Doctorate Colloquium, for helpful hints and discussions. I deeply appreciate the discussion with and help of Philip Reuter, Klaus Dalinghaus, Graham Katz, Angela Grimm, Peter Geibel, Joscha Bach, Gemma Boleda (for help with Spanish), Mar- tin Volk (who also provided morphological annotations), Judith Degen (especially for her word alignment annotations), and the participants of the Stockholm Symposium on Parallel Treebanks which took place in September 2006. I also thank my proof-readers Arne Fitschen, Julia Ritz, and Mark Hopkins. Many thanks also to the members of my dance club, most importantly to Britta Deuwerth, Thomas Vinke, Constanze Weth and Jurgen¨ Volbers for taking my mind off the thesis every now and then. Last, but definitely not least, I thank my parents and my brother: I would not have embarked on the mission to write a PhD thesis, much less completed, without their unwavering support. 3 Exploiting linguistic and statistical knowledge 0.0 Bettina Schrader 4 PhD Thesis Abstract Within machine translation, the alignment of corpora has evolved into a mature research area, aimed at providing training data for statistical or example-based machine translation systems. Moreover, the alignment information can be used for a variety of other purposes, including lex- icography and the induction of tools for natural language processing. The alignment techniques used for these purposes fall roughly in two separate classes: sentence alignment approaches that often combine statistical and linguistic information, and word alignment models that are dominated by the statistical machine translation paradigm. Alignment approaches that use linguistic knowl- edge provided by corpus annotation are rare, as are as non-statistical word alignment strategies. Furthermore, parallel corpora are typically not aligned at all text levels simultaneously. Rather, a corpus is first sentence aligned, and in a subsequent step, the alignment information is refined to go below the sentence level. In this thesis, the distinction between the two alignment classes is withdrawn. Rather, a system is introduced that can simultaneously align at the paragraph, sentence, word, and phrase level. Fur- thermore, linguistic as well as statistical information can be combined. This combination of align- ment cues from different knowledge sources, as well as the combination of the sentence and word alignment tasks, is made possible by the development of a modular alignment platform. Its main features are that it supports different kinds of linguistic corpus annotation, and furthermore aligns a corpus hierarchically, such that sentence and word alignments are cohesive. Alignment cues are not used within a global alignment model. Rather, different sub-models can be implemented and allowed to interact. Most of the alignment modules of the system have been implemented using empirical corpus studies, aimed at showing how the most common types of corpus annotation can be exploited for the alignment task. 5 Exploiting linguistic and statistical knowledge 0.0 Bettina Schrader 6 PhD Thesis Contents 1 The Global Picture 11 1.1 A Closer Look . 13 1.1.1 Definition of the Task . 13 1.1.2 Existing Parallel Corpora and Treebanks . 15 1.1.3 Uses of Aligned Corpora . 18 1.2 Requirements . 20 1.3 Overview . 21 2 Previous Approaches to Bilingual Text Alignment 23 2.1 Sentence Alignment . 24 2.1.1 Methods using Sentence Length . 24 2.1.2 Methods using Anchor Points . 26 2.1.3 Hybrid Approaches . 35 2.1.4 Discussion of the Sentence Alignment Approaches . 37 2.2 Word Alignment . 39 2.2.1 The Basic Idea: A Statistical Translation Model . 39 2.2.2 Competing Alignment Models . 41 2.2.3 Improvements of Word Alignment Approaches . 45 2.2.4 Discussion . 52 2.3 Phrase Alignment . 53 2.3.1 Alignment of Phrases based on their Similarities . 53 2.3.2 Alignment of Phrases based on Word Links . 54 2.3.3 Summary . 56 2.4 Discussion . 56 3 The Text Alignment System ATLAS 59 3.1 Design Principles . 59 3.2 Supported Language Pairs . 61 3.3 The Development Corpora . 62 3.3.1 The EUROPARL Corpus . 62 3.3.2 The LITERATURE Corpus . 63 3.3.3 The EUNEWS Corpus . 63 3.4 The Currently Supported Types of Corpus Annotation . 63 3.5 The System’s Architecture . 65 3.5.1 The ATLAS Task Manager . 66 3.5.2 The Alignment Strategies . 70 3.5.3 The Alignment Disambiguation . 72 3.6 The Output: An Annotated, Aligned Parallel Corpus . 73 3.7 An Example Alignment Process . 74 7 Exploiting linguistic and statistical knowledge 0.0 3.8 Computational Issues . 76 4 Alignment Phenomena 79 4.1 Length-based Sentence Alignment . 79 4.2 Cognate-based Alignment . 82 4.2.1 Resource-independent Approach . 82 4.2.2 Resource-dependent Approach . 84 4.3 Dictionary-based alignment . 85 4.4 Linear Ordering of Paragraphs and Sentences . 86 4.5 Aligning Nominals: Expression Length and Category Membership . 88 4.5.1 Alignment of Compounds based on Word Length . 93 4.5.2 Alignment of Nominals based on Morphological Information . 97 4.6 Word Category Membership . 98 4.7 Lexicon Induction using Word Rank Correlation . 105 4.8 Phrase Alignment . 112 4.9 Inheritance . 115 4.10 Merge of Overlapping Hypotheses . 116 4.11 Process of Elimination . 117 4.12 Tuning of the Alignment Strategies . 117 4.13 Summary . 120 5 Evaluation 121 5.1 Formal Evaluation Requirements . 122 5.1.1 The Gold Standard . 122 5.1.2 The Annotation Guideline . 122 5.1.3 The Evaluation Metrics . 123 5.2 Former Approaches to Alignment Evaluation . 125 5.2.1 ARCADE I: Evaluating Sentence Alignment . 125 5.2.2 ARCADE I and II: Translation Spotting . 126 5.2.3 BLINKER: Creation of a Gold Standard . 128 5.2.4 The PLUG Approach: Taking Partial Links into Account . 132 5.2.5 The Standard Exercise: Precision, Recall and Alignment Error Rate . 136 5.2.6 The PWA approach: Refining Precision and Recall . 141 5.3 Definition of a New Alignment Evaluation Approach . 143 5.3.1 Gold Standard . 143 5.3.2 Annotation Guideline . 145 5.3.3 Guiding Principles . 147 5.3.4 Evaluation Metrics . 157 5.4 Evaluating ATLAS .................................. 162 5.4.1 Sentence and Paragraph Alignment Quality on the EuNews Corpus . 163 5.4.2 Sentence and Paragraph Alignment Quality on Europarl . 164 5.4.3 Word Alignment Quality on Europarl . 165 5.5 Summary . 168 6 Conclusion 169 6.1 Summary . 169 6.2 Future Improvement Directions . 171 Bettina Schrader 8 PhD Thesis Exploiting linguistic and statistical knowledge 0.0 A Technical Details 173 A.1 Adding New Languages . 173 A.2 Input and Format Specifications . 173 A.3 Extending the Set of Supported Corpus Annotations . 174 A.4 New Alignment Strategy . 175 A.5 Output and Output Formats . 176 B Example XML Files 179 B.1 Options File . 179 B.2 Lexicon File . 181 B.3 Corpus Files . 182 B.3.1 Monolingual Corpus File . 182 B.3.2 Alignment File . 183 C DTDs 185 D Tagsets 187 D.1 English: Penn Treebank Tagset . 187 D.2 French: Tree-Tagger Tagset . 188 D.3 German: Stuttgart-Tubingen-Tagset¨ . 189 D.4 Italian: Tree-Tagger Tagset . 190 D.5 Spanish: Tree-Tagger Tagset . 191 D.6 Swedish: SUC Tagset . 192 E Parameter Files 193 E.1 English Parameter File . 193 E.2 French Parameter File . ..