Automatic Syntactic Processing of Modern Hebrew

Automatic Syntactic Processing of Modern Hebrew Dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY by Yoav Goldberg Submitted to the Senate of Ben-Gurion University of the Negev November 2011 Beer-Sheva Automatic Syntactic Processing of Modern Hebrew Dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY by Yoav Goldberg Submitted to the Senate of Ben-Gurion University of the Negev Author signature: Approved by the advisor: Approved by the Dean of the Kreitman School of Advanced Graduate Studies: November 2011 Beer-Sheva This work was carried out under the supervision of Prof. Michael Elhadad In the Department of Computer Science Faculty: Natural Sciences To those who find it interesting and those who don’t Contents Abstract vi Acknowledgements x List of Figures xiii List of Tables xiii 1 Introduction 1 1.1 Main themes . .3 1.1.1 Separation of lexical and syntactic knowledge . .3 1.1.2 Using morphological information to improve parsing accuracy .4 1.1.3 Encoding input uncertainty using a lattice-based representation .4 1.2 Contributions . .4 1.2.1 Parsing systems for Modern Hebrew . .4 1.2.2 Modern Hebrew dependency Treebank . .5 1.2.3 Algorithmic and methodological contributions . .5 1.3 Structure of this thesis . .6 I Background 8 2 Syntactic Representations of Language 9 2.1 Syntactic structures . .9 2.1.1 Chunks . .9 2.1.2 Dependency trees . 10 2.1.3 Constituency trees . 13 2.1.4 Constituency to dependency conversion . 14 2.2 Syntactic disambiguation . 15 2.3 Grammar-based vs. data-driven parsers . 17 ii 2.4 Applications of syntactic structure . 18 3 Modern Hebrew 21 3.1 Lexical level . 21 3.1.1 Unvocalized orthography . 21 3.1.2 Affixation . 21 3.1.3 Rich templatic morphology . 22 3.1.4 The participle form . 23 3.2 Syntactic level . 23 3.2.1 Relatively free constituent order . 23 3.2.2 Verbless constructions . 24 3.2.3 NP structure and construct-state . 24 3.2.4 Definiteness . 24 3.2.5 Case marking . 25 3.2.6 Agreement . 25 4 Automatic Syntactic Disambiguation in Modern Hebrew 26 4.1 Aspects of Modern Hebrew parsing . 26 4.1.1 Small amount of annotated data . 26 4.1.2 Ambiguous word segmentation . 27 4.1.3 Morphological variation and high out-of-vocabulary rate . 27 4.1.4 Morphological agreement . 29 4.2 Existing resources for Hebrew text processing . 29 4.2.1 The Hebrew constituency-Treebank . 29 4.2.2 The MILA broad-coverage lexicon . 31 4.2.3 Hebrew morphological disambiguator . 33 4.2.4 A resource-incompatibility issue . 33 4.3 Summary . 34 II Dependency Parsing 36 5 Hebrew Dependency Treebank 37 5.1 Dependency annotation guidelines . 37 5.1.1 Tagging and segmentation . 37 5.1.2 Inter-word relations . 38 5.1.3 Dependency labels . 44 5.2 Constituency to dependency conversion . 45 5.2.1 Fixes and modifications to the Hebrew constituency-Treebank . 46 5.2.2 Head-assignment rules . 47 5.2.3 Treebank statistics . 49 5.3 Summary . 49 6 Background on Dependency Parsing 52 6.1 Evaluation measures . 52 6.2 Dependency-parsing algorithms . 54 6.2.1 Transition-based (Shift Reduce) dependency parsing . 54 6.2.2 Graph-based dependency parsing . 57 6.2.3 Hybrid-systems and ensemble-methods . 59 6.2.4 Labeled dependency parsing . 59 6.3 Dep. parsing of morphologically rich languages . 60 7 EASYFIRST Dependency Parsing 62 7.1 Non-directional easy-first parsing . 62 7.2 The transition system . 63 7.3 Parsing algorithm . 65 7.4 Learning algorithm . 67 7.5 Feature representation . 70 7.6 Computational complexity . 72 7.7 Evaluation on English . 77 7.7.1 Parse diversity . 79 7.7.2 Error analysis / limitations . 80 7.8 Related work . 81 7.9 Chapter summary . 82 8 Hebrew Dependency Parsing 83 8.1 Architecture . 83 8.1.1 Justification for a pipeline architecture . 83 8.1.2 Lexicon/syntax separation . 85 8.2 Data and baseline experiments . 85 8.2.1 Baseline results . 86 8.3 Agreement in the EASYFIRST parser . 89 8.3.1 Cases where agreement can disambiguate attachments . 89 8.3.2 Easy-first morphological agreement features . 92 8.3.3 Evaluation and results . 95 8.4 An ensemble-based parser . 96 8.4.1 Evaluation and results . 97 8.5 Adding an edge labeler . 97 8.5.1 Evaluation and results . 99 8.6 Evaluating the final system . 100 8.7 Conclusions . 101 III Constituency Parsing 103 9 Background on Constituency Parsing 104 9.1 Evaluation measures . 104 9.2 Probabilistic context-free grammars . 105 9.3 Grammar binarization . 107 9.4 The CKY algorithm . 108 9.5 Treebank grammars and grammar refinement . 109 9.6 Automatic state-split grammars (PCFG-LA) . 110 9.7 Parsing morph. rich languages . 112 9.7.1 Arabic . 113 9.7.2 Hebrew and relational-realizational parsing . 113 9.8 Chapter summary . 115 10 A Hebrew Constituency Parser 116 10.1 Baseline experiments . 117 10.1.1 Analyzing the Learned PCFG-LA Grammar . 118 10.1.2 Limitation of PCFG-LA parsing of Modern Hebrew . 120 10.2 Manual state-splits . 121 10.3 Better lexical coverage with an external lexicon . 122 10.3.1 A unified lexical probability model . 123 10.4 Joint segmentation and parsing . 127 10.4.1 Lattice representation . 127 10.4.2 Lattice parsing . 128 10.5 Incorporating morphology . 130 10.5.1 Forcing morphologically-motivated splits . 130 10.5.2 Agreement as filter . 131 10.5.3 The Hebrew agreement filter . 133 10.6 Evaluation and results . 137 10.7 The final model . 139 10.8 Summary . 140 11 Conclusions and Future Work 142 11.1 Open Issues . 146 Bibliography 148 Abstract Language is composed of words, which are combined to form sentences. While single words can convey myriad meanings, it is the combination of words into sentences that allows for efficient communication and the realization of complex ideas. When words are combined to form a sentence, their combination is governed by a set of rules, called the syntax, or the grammar, of the language. Sentences must obey structural constraints posed by the syntax, and the structure of the sentence determines its meaning. The main focus of this thesis is syntactic parsing, the task of assigning syntactic analysis to sentences. The approach taken is data-driven: starting with a list of sentences and their known syntactic structures (a Treebank), a learning algorithm is trained to produce a parser which can assign syntactic structures to novel sentences unseen to the learning algorithm. In this thesis, I focus on automatic parsing of Modern Hebrew, a semitic language with a rich and productive morphology, relatively free word order and a small Treebank. The primary goal of the thesis is the creation of automatic systems capable of parsing Modern Hebrew text with good accuracy. A large part of modern parsing literature is devoted to automatic parsing of English, a language with a relatively simple morphology, relatively fixed word order, and a large Treebank. Data-driven English parsing is now at the state where naturally occurring text in the news domain can be automatically parsed with accuracies of around 90% (according to standard parsing evaluation measures). However, when moving from English to languages with richer morphologies and less-rigid word orders, the parsing algorithms developed for English exhibit a large drop in accuracy. To overcome this drop, I concentrate on aspects in which Modern Hebrew differs from English. In English, word-order is relatively fixed, while in Hebrew as in many other languages word order is much more flexible (for example, the subject may appear either before or after a verb). In languages with flexible word order, the meaning of the sentence is realized using other structural elements, like word inflections or markers, which are referred to as morphology. The lexical units (words) in English are always vii separated by white space. In Hebrew (as in Arabic) most words are separated by white space, but many of the function words do not stand on their own and are instead attached to the following words. The resulting combinations (function word + following word) become a single token, which, crucially, can be read ambiguously, as either a single word or a word combination. Several natural questions arise: can the small size of the Treebank be compensated using other available resources or sources of information? Can morphological information be used effectively in order to improve parsing accuracy? How should the word segmentation issue (that function words do not appear in isolation but attach to the next word, forming ambiguous letter patterns) be handled? Based on these questions, the thesis revolves around the following themes: Separation of lexical and syntactic knowledge: I argue that the amount of syntactic knowledge needed for a parsing system is relatively limited, and that sufficiently large parts of it can be captured also based on a relatively small Treebank. Lexical knowledge, on.

Load more