Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach
Dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY
by
Menahem (Meni) Adler
Submitted to the Senate of Ben-Gurion University of the Negev
September 2007 Beer-Sheva Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach
Dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY
by
Menahem (Meni) Adler
Submitted to the Senate of Ben-Gurion University of the Negev
September 2007 Beer-Sheva This work was carried out under the supervision of Dr. Michael Elhadad
In the Department of Computer Science Faculty: Natural Sciences To Ora In loving memory of my father - Baruch Contents
Abstract viii
List of Figures xiii
List of Tables xiii
1 Introduction 1 1.1 Motivation...... 3 1.2 StartingPoints ...... 4 1.3 Implementation ...... 5 1.4 Contributions ...... 5 1.5 GuidetotheRestoftheDissertation ...... 10
2 Background 13 2.1 MorphologicalModels ...... 14 2.2 Morpheme...... 15 2.3 Lexeme ...... 16 2.3.1 Definition ...... 16 2.3.2 Derivation...... 17 2.4 HebrewWordDefinition ...... 20 2.4.1 GeneralOverview...... 20 2.4.2 DefiniteArticle ...... 23 2.4.3 FormativeLetters...... 25 2.4.4 PronounSuffix ...... 26 2.4.5 Notation...... 27
v 2.5 Hebrew Lexical Categories ...... 28 2.6 HebrewInflectionalProperties ...... 31 2.7 Morphological Analyzer ...... 31 2.8 Morphological Disambiguator ...... 35 2.8.1 Motivation...... 35 2.8.2 Disambiguation Methods ...... 36 2.8.3 Disambiguation in Various Languages ...... 37
3 Objectives 45
4 Resources 49 4.1 Corpora ...... 49 4.2 Morphological Analyzer ...... 51
5 Tagset Design 55 5.1 Methodology ...... 56 5.2 Modality...... 60 5.2.1 ModalityinHebrew...... 60 5.2.2 ProposedModalGuidelines ...... 65 5.2.3 The Importance of Tagging Modals ...... 70 5.3 Beinoni ...... 71 5.3.1 BeinoniinHebrew ...... 71 5.3.2 A Lexical Category for Beinoni ...... 72 5.3.3 Conclusion...... 83 5.4 Adverbs ...... 85 5.4.1 AdverbsinModernHebrew ...... 85 5.4.2 Distinguishing Criteria ...... 87 5.4.3 Summary ...... 92 5.5 Prepositions...... 93 5.5.1 PrepositionsinModernHebrew ...... 93 5.5.2 Distinguishing Criteria ...... 95 5.5.3 Summary ...... 101
vi 5.6 Conclusion...... 101
6 Computational Model 103 6.1 Token-BasedHMM ...... 103 6.2 Word-BasedHMM ...... 108 6.3 Learning and Searching Algorithms for Uncertain Output Observation109 6.3.1 OutputRepresentation ...... 109 6.3.2 Parameter Estimation ...... 112 6.3.3 Searching for the Best-state Sequence ...... 114 6.3.4 SimilarWork ...... 114 6.4 Conclusions ...... 117
7 Unknown Words and Analyses 119 7.1 Motivation...... 119 7.2 Strategy ...... 120 7.3 Previous Work on Unknown Words Tagging ...... 123 7.4 Neologisms Detection ...... 125 7.4.1 Method ...... 125 7.4.2 Evaluation...... 127 7.5 ProperNounsIdentification ...... 129 7.6 Results...... 129
8 Evaluation 133 8.1 InfluenceofInitialConditions ...... 134 8.2 Structure of the Stochastic Model: Dependency Scheme ...... 136 8.3 ModelOrder...... 137 8.4 InfluenceoftheTrainingSetSize ...... 138 8.5 Token-oriented Model vs. Word-oriented Model ...... 138 8.6 ...... 139
9 Applications 143 9.1 Noun-PhraseChunking...... 143
vii 9.1.1 PreviousWork ...... 144 9.1.2 HebrewSimpleNPs...... 144 9.1.3 Evaluation...... 145 9.2 Named Entities Recognition ...... 146 9.2.1 Models...... 147 9.2.2 Evaluation...... 148
10 Contributions and Future Work 149 10.1 Contributions ...... 149 10.2Futurework ...... 152
Appendices 155
A Hebrew Morphology 155 A.1 VerbInflections ...... 155 A.2 NounInflections...... 155 A.3 ShortFormativeWords...... 156 A.4 PronomialPronounSuffixes ...... 156 A.5 Inflection and Affixation according to Lexical Category ...... 158
B Selected Tagging Guidelines 163 B.1 Beinoni ...... 163 B.1.1 Nounsvs.Verbs...... 163 B.1.2 Adjectives vs. Verbs ...... 164 B.1.3 Adjectives vs. Nouns ...... 165
Bibliography 167
viii Abstract
L’objet de ce livre n’est pas exactement le vide, ce serait plutˆot ce qu’il y a autour,
ou dedans.
(Georges Perec, Esp´eces d’espaces)
Morphology is the field of linguistic theory that deals with the internal structure of words. The task of a morphological analyzer is to produce all possible analyses for a given word – what lexeme, prefix, and suffix it includes and for each of these, provide its part of speech and the list of its inflections. The task of a morphological disambiguator is to pick the most likely analysis among those produced by an analyzer. In order to select the ‘most likely analysis’, the context of each word should be taken into account. This work deals with morphological disambiguation of words in Modern Hebrew text.
Morphological disambiguation is an essential component in many natural lan- guage processing (NLP) applications. An information retrieval system, for in-
stance, should find the correct part-of-speech of the Hebrew token h. oze, in order to index in terms of contract (noun) or watch (verb). A text-to-speech
system should determine the gender of the Hebrew token ’iˇsah: feminine (a woman) or masculine (her husband). A machine translation system must deter-
mine the tense of the Hebrew verb is sprw: imperative (count!) or past (they cut hair). In addition, morphological analysis can be used as a knowledge base in- put for other applications. A syntactic parser makes use of the lexical category of the tokens in the text. A noun phrase chunker may be interested in the construct state property of the words to be chunked. Word prediction can be more accurate
ix if the morphological attributes of the previous words are taken into account. In this work, we investigate unsupervised methods for morphological disam- biguation, and present a disambiguation system for Hebrew. The main contribu- tions of this work are:
• Analysis system for Hebrew: We have implemented a complete analysis system for Hebrew that combines all the algorithms and models described in this work. Given a Hebrew text, the system assigns a full set of morphological features for each word, extracts noun phrases, and recognizes entity names (persons, locations, organizations, temporal and number expression). A fully operating version of the system is available online at: http://www.cs.bgu. ac.il/∼nlpproj/demo. The system is implemented in Java, and operates at a rate of about 1,000 words analyzed per second on a standard 2007 PC (1GB of RAM).
• Unsupervised learning model for an affixational language: In contrast to En- glish tag sets, whose sizes range from 48 to 195, the number of tags for our Hebrew corpus, based on all combinations of the morphological attributes, is about 3,600 (about 70 times larger). The large size of such a tag set is problematic in terms of data sparseness. Each morphological combination appears rarely, and more samples are required in order to learn the proba- bilistic model. In order to avoid this problem, we introduce a word-based model, including only about 300 states, reducing the size of the probabilistic model by close to 90%. The application of this model, as opposed to the traditional token-based model, improves model accuracy with over 13% error reduction.
• Initial conditions: Initial conditions are essential for a high quality unsu- pervised learning of probabilistic models. We investigate two methods for initial conditions: morpho-lexical approximations and syntagmatic condi- tions, showing that good initial conditions improve model accuracy with over 15% error reduction.
x • Unknown words analysis: The term unknowns denotes tokens that cannot be analyzed by the morphological analyzer. Unknowns account for 7.5% of the tokens in our corpus. We investigate the characteristics of unknowns in Hebrew and methods for resolution of unknowns, contributing reduction in errors of 23% as opposed to the baseline.
• Evaluation: The system was evaluated according to two criteria: (1) The ac- curacy of the disambiguation process for a full morphological analysis and for a word segmentation and POS tagging, (2) The contribution of the disam- biguator to other applications which use the tagged text. The disambigua- tor was tested on a wide-coverage test corpus of 90K tokens. We report an accuracy of 90% for full morpgological disambiguation, and 93% for word segmentation and POS tagging. In addition, we implement two applications to estimate the impact of the morphological data given by the disambigua- tor: Noun-phrase Chunker and Named-entity Indentifier. Both applications have shown improvement due to the improved morphological information provided by our disambiguator.
• Construction of a high-quality large-scale annotated corpus: We developed a tagged corpus of about 200K tokens. We developed a detailed set of tagging guidelines over a period of 3 years to make sure human taggers reach full agreement. Each article in our corpus was manually tagged by four taggers and disagreements were systematically reviewed and resolved.
• Tagset for Hebrew: The main morphological property of words is their lexical category – their part of speech. While working on annotation of Hebrew text, we surprisingly realized that a complete list of parts of speech is not well established, and that there is no agreement, among dictionaries and automatic tools, on the part-of-speech set for Hebrew. Our main conclusion is that the tagset and the tagging criteria used for a given language cannot be imported from another, nor rely on existing dictionaries. Instead, it should be specifically defined over large-scale corpora of a given language, in
xi order to tag all words with high agreement. In this work, we have detailed the method we applied to design a comprehensive tagset for Hebrew and reported the remaining intrinsically difficult confusion cases.
Keywords Computational linguistics, Natural language processing, Morphol- ogy, Hebrew, Parts-of-speech tagging, Morphological analysis, Morphological dis- ambiguation, Stochastic model, Unsupervised learning, Word-based representa- tion, Tagset design.
xii List of Figures
1.1 Architectureofthesystem...... 6
2.1 Disambiguation process schema...... 36 2.2 Disambiguatortypes...... 38
6.1 Markovprocess...... 105 6.2 Markov process for output sequence: start drinking...... 106 6.3 The search algorithm for a first-order token-based model...... 107 6.4 The learning algorithm for a first-order token-based model...... 107 6.5 Representation of the sentence: bclm hn‘im...... 111 6.6 Vector representation of the first three time slots...... 111 6.7 Representation of the sentence: hw’ ‘wrk dyn gdwl...... 111 6.8 Representation of the sentence: nwsp lskr hrgl...... 111 6.9 The learning algorithm for first-order word-based model...... 113 6.10 The searching algorithm for first-order word-based model...... 113
8.1 Firstordermodel–Dependencyscheme1...... 141 8.2 Partial second order model – Dependency scheme 1...... 141 8.3 Secondordermodel–Dependencyscheme1...... 141 8.4 Firstordermodel–Dependencyscheme2...... 142 8.5 Partial second order model – Dependency scheme 2...... 142 8.6 Secondordermodel–Dependencyscheme2...... 142
xiii xiv List of Tables
1.1 Possible analyses for the words bclm, hn‘ym ...... 3
2.1 Word categorization according to Rosen’s four categorial dimensions. 29 2.2 Parts of speech sets of various computational analyzers for Hebrew. 32 2.3 Distribution of inflections/derivations for Turkish...... 43
4.1 Statistics of the raw-text corpora used for morphological analysis. . 51 4.2 Statistics of the annotated corpora used for morphological analysis. 51 4.3 POS distribution of the lexicon entries...... 52
5.1 Parts of speech of selected modals in various dictionaries...... 59 5.2 Suggested POS lists for selected participle forms in various dictio- naries...... 75 5.3 Morphological classification of participle forms...... 75 5.4 Classification of Hebrew prepositions...... 95
6.1 Modelsizes...... 108 6.2 Statelist...... 112
7.1 Unknown token categories and distribution...... 121 7.2 UnknownsPOSdistribution...... 122 7.3 Evaluation of unknown token full morphological analysis...... 128 7.4 Evaluation of unknown token POS tagging...... 128 7.5 Common neologism formations...... 131
8.1 Initialconditions–scheme1,model2-...... 135 8.2 Dependencyschemes–model2-...... 136
xv 8.3 Modelorder–initialconditions ...... 137 8.4 Trainingsetsize–model2-,scheme1...... 138 8.5 Word model vs. Token model – scheme 1, model 2-, initial conditions139 8.6 Confusionmatrix...... 140
9.1 HebrewNP-chunkingresults...... 146 9.2 Named Entity Recognition – The combined model results...... 148
A.1 Verbinflections...... 155 A.2 Possessive pronoun suffixes...... 156 A.3 Shortformativewords...... 157 A.4 Parts-of-speechinflections ...... 158
xvi Imaginons un homme dont la fortune n’aurait d’´egale que l’indiff´erence `ace que la fortune permet g´en´eralement, et dont le d´esir serait, beaucoup plus orgueilleusement, de saisir, de d´ecrire, d’´epuiser, non la totalit´edu monde – projet que le seul ´enonc´esuffit `aruiner – mais un fragment constitu´ede celui-ci : face `al’inextricable incoh´erence du monde, il s’agirait alors d’accomplir jusqu’au bout un programme, restreint sans doute, mais entier, intact, irr´eductible. Bartlebooth, en d’autres termes, d´ecida un jour que sa vie toute enti`ere serait organis´ee autour d’un projet unique dont la n´ecessit´earbitraire n’aurait d’autre fin qu’elle mˆeme. Cette id´ee lui vint alors qu’il avait vingt ans. Ce fut d’abord une id´ee vague, une question qui se posait – que faire ? –, une r´eponse qui s’esquissait : rien. L’argent, le pouvoir, l’art, les femmes, n’int´eressaient pas Bartlebooth. Ni la science, ni mˆeme le jeu. Tout au plus les cravates et les chevaux ou, si l’on pr´ef`ere, impr´ecise mais palpitante sous ces illustrations futiles (encore que des milliers de personnes ordonnent efficacement leur vie autours de leur cravates et un nombre bien plus grand encore autour de leurs chevaux du dimanche), une certaine id´ee de la perfection. Elle se d´eveloppa dans les mois, dans les ann´ees qui suivirent, s’articulant autour de trois principes directeurs :
Le premier fut d’ordre moral : il ne s’agirait pas d’un exploit ou d’un record, ni d’un pic `agravir, ni d’un fond `aatteindre. Ce que ferait Bartlebooth ne serait ni spectaculaire ni h´ero¨ıque; ce serait simplement, discr`etement, un projet, difficile certes, mais non irr´ealisable, maˆıtris´e d’un bout `al’autre et qui en retour, gouvernerait, dans tous ces d´etails, la vie de celui qui s’y consacrerait.
Le second fut d’ordre logique : excluant tout recours au hasard, l’entreprise ferait fonctionner le temps et l’espace comme des coordonn´ees abstraites o`uviendrait s’inscrire avec une r´ecurrence in´eluctable des ´ev´enements identiques se produisant inexorablement dans leur lieu, `aleur date.
Le troisi`eme, enfin, fut d’ordre esth´etique : inutile, sa gratuit´e´etant l’unique garantie de sa rigueur, le projet se d´etruirait lui-mˆeme au fur et `amesure qu’il s’accomplirait ; sa perfec- tion serait circulaire : une succession d’´ev´enements qui, s’enchaˆınant, s’annuleraient : parti de rien, Bartlebooth reviendrait au rien, `atravers des transformations pr´ecises d objets finis.
Ainsi s’organisa concr`etement un programme que l’on peut ´enoncer succinctement ainsi : Pendant dix ans, de 1925 `a1935, Bartlebooth s’initieraita ` l’art de l’aquarelle. Pendant vingt ans, de 1935 `a1955, il parcourrait le monde, peignant, `araison d’une aquarelle tous les quinze jours, cinq cents marines de mˆeme format (65 X 50, ou raisin) repr´esentant des ports de mer. Chaque fois qu’une de ces marines serait achev´ee, elle serait envoy´ee `aun artisan sp´ecialis´e(Gaspard Winckler) qui la collerait sur une mince plaque de bois et la d´ecouperait en un puzzle de sept cent cinquante pi`eces. Pendant vingt ans, de 1955 `a1975, Bartlebooth, revenu en France, reconstituerait, dans l’ordre, les puzzles ainsi pr´epar´es, `araison, de nouveau, d’un puzzle tous les quinze jours. A mesure que les puzzles seraient r´eassembl´es, les marines seraient ≪ retextur´ees ≫ de mani`ere `ace qu’on puisse les d´ecoller de leur support, transport´ees `al’endroit mˆeme o`u – vingt ans auparavant – elles avaient ´et´epeintes, et plong´ees dans une solution d´etersive d’o`une ressortirait qu’une feuille de papier Whatman, intacte et vierge.
Aucune trace, ainsi, ne resterait de cette op´eration qui aurait, pendant cinquante ans, enti`erement mobilis´eson auteur.
(Georges Perec, La Vie Mode D’emploi)
xvii xviii Chapter 1
Introduction
Au d´epart, l’art du puzzle semble un art bref, un art mince, tout entier con- tenu dans un maigre enseignement de la
Gestalttheorie.