Lexical Selection for Machine Translation

Lexical Selection for Machine Translation A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences 2011 Yasser Muhammad Naguib Sabtan School of Computer Science Contents CONTENTS ................................................................................................................ 2 LIST OF FIGURES ................................................................................................... 6 LIST OF TABLES ..................................................................................................... 8 ABSTRACT .............................................................................................................. 10 DECLARATION ...................................................................................................... 11 COPYRIGHT STATEMENT ................................................................................. 12 PUBLICATION BASED ON THE THESIS ......................................................... 13 ACKNOWLEDGMENTS ....................................................................................... 14 TRANSLITERATION TABLE .............................................................................. 15 LIST OF ABBREVIATIONS AND ACRONYMS ............................................... 16 CHAPTER 1 INTRODUCTION ........................................................................ 18 1.1 Motivation .................................................................................................. 18 1.2 About Arabic .............................................................................................. 21 1.3 Research Aim ............................................................................................. 25 1.4 Thesis Structure .......................................................................................... 33 CHAPTER 2 DESCRIPTION OF THE CORPUS ........................................... 37 2.1 Types of Corpora........................................................................................ 37 2.2 The Rationale behind our Selection ........................................................... 38 2.3 Robustness of the Approach ...................................................................... 40 2.3.1 Some Linguistic Features of the Qur‟anic Corpus ................................. 41 2.3.1.1 Lack of Punctuation ...................................................................... 42 2.3.1.2 Foregrounding and Backgrounding .............................................. 42 2.3.1.3 Lexical Compression .................................................................... 44 2.3.1.4 Culture-Bound Items..................................................................... 45 2.3.1.5 Metaphorical Expressions ............................................................. 46 2.3.1.6 Verbal Idioms ............................................................................... 47 2.3.1.7 Grammatical Shift ......................................................................... 49 2.4 Summary .................................................................................................... 50 CHAPTER 3 AN OVERVIEW OF MACHINE TRANSLATION (MT) ....... 52 3.1 Introduction ................................................................................................ 52 3.2 Basic MT Strategies ................................................................................... 53 3.3 Paradigmatic Approaches to MT ............................................................... 55 3.3.1 Rule-Based Machine Translation (RBMT) ............................................ 55 3.3.2 Corpus-Based Machine Translation (CBMT) ........................................ 56 3.3.2.1 Example-Based Machine Translation (EBMT) ............................ 57 3.3.2.2 Statistical Machine Translation (SMT)......................................... 60 3.3.3 Hybrid Approaches ................................................................................ 62 3.4 State of the Art in Lexical Selection .......................................................... 62 3.5 Summary .................................................................................................... 73 CHAPTER 4 PART-OF-SPEECH (POS) TAGGING ..................................... 75 4.1 Introduction ................................................................................................ 75 2 4.2 Arabic Morphological Analysis ................................................................. 75 4.2.1 Arabic Grammatical Parts-of-Speech .................................................... 76 4.2.2 Arabic Roots and Patterns ...................................................................... 77 4.2.3 Linguistic Analysis of Non-concatenative Morphology ........................ 79 4.2.4 Arabic Word Structure ........................................................................... 82 4.3 POS Tagging .............................................................................................. 83 4.3.1 Tagsets.................................................................................................... 84 4.3.2 POS Tagging Approaches ...................................................................... 85 4.3.3 Challenges for Arabic POS Tagging ...................................................... 87 4.3.4 Review of Arabic POS Taggers ............................................................. 88 4.4 Arabic Lexicon-Free POS Tagger .............................................................. 90 4.4.4 Arabic Tagset ......................................................................................... 91 4.4.1.1 Open-Class Words ........................................................................ 92 4.4.1.2 Closed-Class Words ...................................................................... 93 4.4.2 Our Approach to Arabic POS Tagging .................................................. 98 4.4.2.1 Rule-Based Tagging ................................................................... 104 4.4.2.1.1 Tagging Algorithm ................................................................. 104 4.4.2.1.2 Handling Closed-Class Words ............................................... 107 4.4.2.1.3 Handling Open-Class Words .................................................. 109 4.4.2.1.3.1 Nouns ............................................................................... 111 4.4.2.1.3.2 Verbs ............................................................................... 116 4.4.2.1.4 Problems ................................................................................. 119 4.4.2.2 Transformation-Based Learning (TBL) ...................................... 121 4.4.2.2.1 Training and Test Sets ............................................................ 126 4.4.2.3 Bayesian Model .......................................................................... 128 4.4.2.4 Maximum Likelihood Estimation (MLE) ................................... 133 4.4.2.5 TBL Revisited ............................................................................. 135 4.5 English Lexicon-Free POS Tagger .......................................................... 137 4.6 Summary .................................................................................................. 141 CHAPTER 5 DEPENDENCY RELATIONS .................................................. 143 5.1 Introduction .............................................................................................. 143 5.2 Arabic Syntax: A Descriptive Analysis ................................................... 145 5.2.1 Basic Arabic Sentence Structure .......................................................... 147 5.2.1.1 Nominal Sentences ..................................................................... 148 5.2.1.2 Verbal Sentences......................................................................... 152 5.2.2 Construct Phrases ................................................................................. 154 5.2.3 Agreement & Word Order ................................................................... 155 5.2.4 Sources of Syntactic Ambiguity in Arabic........................................... 159 5.2.4.1 Lack of Diacritics........................................................................ 160 5.2.4.2 Arabic Pro-drop .......................................................................... 161 5.2.4.3 Word Order Variation ................................................................. 163 5.3 Main Approaches to Syntactic Analysis .................................................. 164 5.3.1 Phrase Structure Grammar (PSG) ........................................................ 165 5.3.2 Dependency Grammar (DG) ................................................................ 168 5.3.2.1 The Notion of Dependency ......................................................... 169 5.3.2.2 DG Notational Variants .............................................................. 172 5.3.2.3 Conversion from DG Analysis to PSG Analysis ........................ 176 5.3.2.4 The Notion of Valency ............................................................... 177 5.3.2.5 Dependency Parsing & Free Word Order Languages ................ 180 3 5.4 Arabic Lexicon-Free Dependency Parser ................................................ 184 5.4.1 Introduction .......................................................................................... 184 5.4.2 Arabic Dependency Relations

Load more