Hebrew Morphological Disambiguation: an Unsupervised Stochastic Word-Based Approach
Total Page:16
File Type:pdf, Size:1020Kb
Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach Dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY by Menahem (Meni) Adler Submitted to the Senate of Ben-Gurion University of the Negev September 2007 Beer-Sheva Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach Dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY by Menahem (Meni) Adler Submitted to the Senate of Ben-Gurion University of the Negev September 2007 Beer-Sheva This work was carried out under the supervision of Dr. Michael Elhadad In the Department of Computer Science Faculty: Natural Sciences To Ora In loving memory of my father - Baruch Contents Abstract viii List of Figures xiii List of Tables xiii 1 Introduction 1 1.1 Motivation................................ 3 1.2 StartingPoints ............................. 4 1.3 Implementation ............................. 5 1.4 Contributions .............................. 5 1.5 GuidetotheRestoftheDissertation . 10 2 Background 13 2.1 MorphologicalModels . 14 2.2 Morpheme................................ 15 2.3 Lexeme ................................. 16 2.3.1 Definition ............................ 16 2.3.2 Derivation............................ 17 2.4 HebrewWordDefinition . 20 2.4.1 GeneralOverview. 20 2.4.2 DefiniteArticle ......................... 23 2.4.3 FormativeLetters. 25 2.4.4 PronounSuffix ......................... 26 2.4.5 Notation............................. 27 v 2.5 Hebrew Lexical Categories . 28 2.6 HebrewInflectionalProperties . 31 2.7 Morphological Analyzer . 31 2.8 Morphological Disambiguator . 35 2.8.1 Motivation............................ 35 2.8.2 Disambiguation Methods . 36 2.8.3 Disambiguation in Various Languages . 37 3 Objectives 45 4 Resources 49 4.1 Corpora ................................. 49 4.2 Morphological Analyzer . 51 5 Tagset Design 55 5.1 Methodology .............................. 56 5.2 Modality................................. 60 5.2.1 ModalityinHebrew. 60 5.2.2 ProposedModalGuidelines . 65 5.2.3 The Importance of Tagging Modals . 70 5.3 Beinoni ................................. 71 5.3.1 BeinoniinHebrew ....................... 71 5.3.2 A Lexical Category for Beinoni ................ 72 5.3.3 Conclusion............................ 83 5.4 Adverbs ................................. 85 5.4.1 AdverbsinModernHebrew . 85 5.4.2 Distinguishing Criteria . 87 5.4.3 Summary ............................ 92 5.5 Prepositions............................... 93 5.5.1 PrepositionsinModernHebrew . 93 5.5.2 Distinguishing Criteria . 95 5.5.3 Summary ............................101 vi 5.6 Conclusion................................101 6 Computational Model 103 6.1 Token-BasedHMM ...........................103 6.2 Word-BasedHMM ...........................108 6.3 Learning and Searching Algorithms for Uncertain Output Observation109 6.3.1 OutputRepresentation . .109 6.3.2 Parameter Estimation . 112 6.3.3 Searching for the Best-state Sequence . 114 6.3.4 SimilarWork ..........................114 6.4 Conclusions ...............................117 7 Unknown Words and Analyses 119 7.1 Motivation................................119 7.2 Strategy .................................120 7.3 Previous Work on Unknown Words Tagging . 123 7.4 Neologisms Detection . 125 7.4.1 Method .............................125 7.4.2 Evaluation............................127 7.5 ProperNounsIdentification . .129 7.6 Results..................................129 8 Evaluation 133 8.1 InfluenceofInitialConditions . .134 8.2 Structure of the Stochastic Model: Dependency Scheme . 136 8.3 ModelOrder...............................137 8.4 InfluenceoftheTrainingSetSize . .138 8.5 Token-oriented Model vs. Word-oriented Model . 138 8.6 .....................................139 9 Applications 143 9.1 Noun-PhraseChunking. .143 vii 9.1.1 PreviousWork .........................144 9.1.2 HebrewSimpleNPs. .144 9.1.3 Evaluation............................145 9.2 Named Entities Recognition . 146 9.2.1 Models..............................147 9.2.2 Evaluation............................148 10 Contributions and Future Work 149 10.1 Contributions . .149 10.2Futurework ...............................152 Appendices 155 A Hebrew Morphology 155 A.1 VerbInflections .............................155 A.2 NounInflections.............................155 A.3 ShortFormativeWords. .156 A.4 PronomialPronounSuffixes . .156 A.5 Inflection and Affixation according to Lexical Category . 158 B Selected Tagging Guidelines 163 B.1 Beinoni .................................163 B.1.1 Nounsvs.Verbs.........................163 B.1.2 Adjectives vs. Verbs . 164 B.1.3 Adjectives vs. Nouns . 165 Bibliography 167 viii Abstract L’objet de ce livre n’est pas exactement le vide, ce serait plutˆot ce qu’il y a autour, ou dedans. (Georges Perec, Esp´eces d’espaces) Morphology is the field of linguistic theory that deals with the internal structure of words. The task of a morphological analyzer is to produce all possible analyses for a given word – what lexeme, prefix, and suffix it includes and for each of these, provide its part of speech and the list of its inflections. The task of a morphological disambiguator is to pick the most likely analysis among those produced by an analyzer. In order to select the ‘most likely analysis’, the context of each word should be taken into account. This work deals with morphological disambiguation of words in Modern Hebrew text. Morphological disambiguation is an essential component in many natural lan- guage processing (NLP) applications. An information retrieval system, for in- stance, should find the correct part-of-speech of the Hebrew token h. oze, in order to index in terms of contract (noun) or watch (verb). A text-to-speech system should determine the gender of the Hebrew token Ý ’iˇsah: feminine (a woman) or masculine (her husband). A machine translation system must deter- mine the tense of the Hebrew verb ÜØÕ is sprw: imperative (count!) or past (they cut hair). In addition, morphological analysis can be used as a knowledge base in- put for other applications. A syntactic parser makes use of the lexical category of the tokens in the text. A noun phrase chunker may be interested in the construct state property of the words to be chunked. Word prediction can be more accurate ix if the morphological attributes of the previous words are taken into account. In this work, we investigate unsupervised methods for morphological disam- biguation, and present a disambiguation system for Hebrew. The main contribu- tions of this work are: • Analysis system for Hebrew: We have implemented a complete analysis system for Hebrew that combines all the algorithms and models described in this work. Given a Hebrew text, the system assigns a full set of morphological features for each word, extracts noun phrases, and recognizes entity names (persons, locations, organizations, temporal and number expression). A fully operating version of the system is available online at: http://www.cs.bgu. ac.il/∼nlpproj/demo. The system is implemented in Java, and operates at a rate of about 1,000 words analyzed per second on a standard 2007 PC (1GB of RAM). • Unsupervised learning model for an affixational language: In contrast to En- glish tag sets, whose sizes range from 48 to 195, the number of tags for our Hebrew corpus, based on all combinations of the morphological attributes, is about 3,600 (about 70 times larger). The large size of such a tag set is problematic in terms of data sparseness. Each morphological combination appears rarely, and more samples are required in order to learn the proba- bilistic model. In order to avoid this problem, we introduce a word-based model, including only about 300 states, reducing the size of the probabilistic model by close to 90%. The application of this model, as opposed to the traditional token-based model, improves model accuracy with over 13% error reduction. • Initial conditions: Initial conditions are essential for a high quality unsu- pervised learning of probabilistic models. We investigate two methods for initial conditions: morpho-lexical approximations and syntagmatic condi- tions, showing that good initial conditions improve model accuracy with over 15% error reduction. x • Unknown words analysis: The term unknowns denotes tokens that cannot be analyzed by the morphological analyzer. Unknowns account for 7.5% of the tokens in our corpus. We investigate the characteristics of unknowns in Hebrew and methods for resolution of unknowns, contributing reduction in errors of 23% as opposed to the baseline. • Evaluation: The system was evaluated according to two criteria: (1) The ac- curacy of the disambiguation process for a full morphological analysis and for a word segmentation and POS tagging, (2) The contribution of the disam- biguator to other applications which use the tagged text. The disambigua- tor was tested on a wide-coverage test corpus of 90K tokens. We report an accuracy of 90% for full morpgological disambiguation, and 93% for word segmentation and POS tagging. In addition, we implement two applications to estimate the impact of the morphological data given by the disambigua- tor: Noun-phrase Chunker and Named-entity Indentifier. Both applications have shown improvement due to the improved morphological information provided by our disambiguator. • Construction of a high-quality large-scale annotated corpus: We developed a tagged corpus of about 200K tokens. We developed a detailed