Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach

Dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

by

Menahem (Meni) Adler

Submitted to the Senate of Ben-Gurion University of the Negev

September 2007 Beer-Sheva Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach

Dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

by

Menahem (Meni) Adler

Submitted to the Senate of Ben-Gurion University of the Negev

September 2007 Beer-Sheva This work was carried out under the supervision of Dr. Michael Elhadad

In the Department of Computer Science Faculty: Natural Sciences To Ora In loving memory of my father - Baruch Contents

Abstract viii

List of Figures xiii

List of Tables xiii

1 Introduction 1 1.1 Motivation...... 3 1.2 StartingPoints ...... 4 1.3 Implementation ...... 5 1.4 Contributions ...... 5 1.5 GuidetotheRestoftheDissertation ...... 10

2 Background 13 2.1 MorphologicalModels ...... 14 2.2 Morpheme...... 15 2.3 Lexeme ...... 16 2.3.1 Definition ...... 16 2.3.2 Derivation...... 17 2.4 HebrewWordDefinition ...... 20 2.4.1 GeneralOverview...... 20 2.4.2 DefiniteArticle ...... 23 2.4.3 FormativeLetters...... 25 2.4.4 PronounSuffix ...... 26 2.4.5 Notation...... 27

v 2.5 Hebrew Lexical Categories ...... 28 2.6 HebrewInflectionalProperties ...... 31 2.7 Morphological Analyzer ...... 31 2.8 Morphological Disambiguator ...... 35 2.8.1 Motivation...... 35 2.8.2 Disambiguation Methods ...... 36 2.8.3 Disambiguation in Various Languages ...... 37

3 Objectives 45

4 Resources 49 4.1 Corpora ...... 49 4.2 Morphological Analyzer ...... 51

5 Tagset Design 55 5.1 Methodology ...... 56 5.2 Modality...... 60 5.2.1 ModalityinHebrew...... 60 5.2.2 ProposedModalGuidelines ...... 65 5.2.3 The Importance of Tagging Modals ...... 70 5.3 Beinoni ...... 71 5.3.1 BeinoniinHebrew ...... 71 5.3.2 A Lexical Category for Beinoni ...... 72 5.3.3 Conclusion...... 83 5.4 Adverbs ...... 85 5.4.1 AdverbsinModernHebrew ...... 85 5.4.2 Distinguishing Criteria ...... 87 5.4.3 Summary ...... 92 5.5 Prepositions...... 93 5.5.1 PrepositionsinModernHebrew ...... 93 5.5.2 Distinguishing Criteria ...... 95 5.5.3 Summary ...... 101

vi 5.6 Conclusion...... 101

6 Computational Model 103 6.1 Token-BasedHMM ...... 103 6.2 Word-BasedHMM ...... 108 6.3 Learning and Searching Algorithms for Uncertain Output Observation109 6.3.1 OutputRepresentation ...... 109 6.3.2 Parameter Estimation ...... 112 6.3.3 Searching for the Best-state Sequence ...... 114 6.3.4 SimilarWork ...... 114 6.4 Conclusions ...... 117

7 Unknown Words and Analyses 119 7.1 Motivation...... 119 7.2 Strategy ...... 120 7.3 Previous Work on Unknown Words Tagging ...... 123 7.4 Neologisms Detection ...... 125 7.4.1 Method ...... 125 7.4.2 Evaluation...... 127 7.5 ProperNounsIdentification ...... 129 7.6 Results...... 129

8 Evaluation 133 8.1 InfluenceofInitialConditions ...... 134 8.2 Structure of the Stochastic Model: Dependency Scheme ...... 136 8.3 ModelOrder...... 137 8.4 InfluenceoftheTrainingSetSize ...... 138 8.5 Token-oriented Model vs. Word-oriented Model ...... 138 8.6 ...... 139

9 Applications 143 9.1 Noun-PhraseChunking...... 143

vii 9.1.1 PreviousWork ...... 144 9.1.2 HebrewSimpleNPs...... 144 9.1.3 Evaluation...... 145 9.2 Named Entities Recognition ...... 146 9.2.1 Models...... 147 9.2.2 Evaluation...... 148

10 Contributions and Future Work 149 10.1 Contributions ...... 149 10.2Futurework ...... 152

Appendices 155

A Hebrew Morphology 155 A.1 VerbInflections ...... 155 A.2 NounInflections...... 155 A.3 ShortFormativeWords...... 156 A.4 PronomialPronounSuffixes ...... 156 A.5 Inflection and Affixation according to Lexical Category ...... 158

B Selected Tagging Guidelines 163 B.1 Beinoni ...... 163 B.1.1 Nounsvs.Verbs...... 163 B.1.2 Adjectives vs. Verbs ...... 164 B.1.3 Adjectives vs. Nouns ...... 165

Bibliography 167

viii Abstract

L’objet de ce livre n’est pas exactement le vide, ce serait plutˆot ce qu’il y a autour,

ou dedans.

(Georges Perec, Esp´eces d’espaces)

Morphology is the field of linguistic theory that deals with the internal structure of words. The task of a morphological analyzer is to produce all possible analyses for a given word – what lexeme, prefix, and suffix it includes and for each of these, provide its part of speech and the list of its inflections. The task of a morphological disambiguator is to pick the most likely analysis among those produced by an analyzer. In order to select the ‘most likely analysis’, the context of each word should be taken into account. This work deals with morphological disambiguation of words in Modern Hebrew text.

Morphological disambiguation is an essential component in many natural lan- guage processing (NLP) applications. An information retrieval system, for in-

stance, should find the correct part-of-speech of the Hebrew token h. oze, in order to index in terms of contract (noun) or watch (verb). A text-to-speech

system should determine the gender of the Hebrew token ’iˇsah: feminine (a woman) or masculine (her husband). A machine translation system must deter-

mine the tense of the Hebrew verb is sprw: imperative (count!) or past (they cut hair). In addition, morphological analysis can be used as a knowledge base in- put for other applications. A syntactic parser makes use of the lexical category of the tokens in the text. A noun phrase chunker may be interested in the construct state property of the words to be chunked. Word prediction can be more accurate

ix if the morphological attributes of the previous words are taken into account. In this work, we investigate unsupervised methods for morphological disam- biguation, and present a disambiguation system for Hebrew. The main contribu- tions of this work are:

• Analysis system for Hebrew: We have implemented a complete analysis system for Hebrew that combines all the algorithms and models described in this work. Given a Hebrew text, the system assigns a full set of morphological features for each word, extracts noun phrases, and recognizes entity names (persons, locations, organizations, temporal and number expression). A fully operating version of the system is available online at: http://www.cs.bgu. ac.il/∼nlpproj/demo. The system is implemented in Java, and operates at a rate of about 1,000 words analyzed per second on a standard 2007 PC (1GB of RAM).

• Unsupervised learning model for an affixational language: In contrast to En- glish tag sets, whose sizes range from 48 to 195, the number of tags for our Hebrew corpus, based on all combinations of the morphological attributes, is about 3,600 (about 70 times larger). The large size of such a tag set is problematic in terms of data sparseness. Each morphological combination appears rarely, and more samples are required in order to learn the proba- bilistic model. In order to avoid this problem, we introduce a word-based model, including only about 300 states, reducing the size of the probabilistic model by close to 90%. The application of this model, as opposed to the traditional token-based model, improves model accuracy with over 13% error reduction.

• Initial conditions: Initial conditions are essential for a high quality unsu- pervised learning of probabilistic models. We investigate two methods for initial conditions: morpho-lexical approximations and syntagmatic condi- tions, showing that good initial conditions improve model accuracy with over 15% error reduction.

x • Unknown words analysis: The term unknowns denotes tokens that cannot be analyzed by the morphological analyzer. Unknowns account for 7.5% of the tokens in our corpus. We investigate the characteristics of unknowns in Hebrew and methods for resolution of unknowns, contributing reduction in errors of 23% as opposed to the baseline.

• Evaluation: The system was evaluated according to two criteria: (1) The ac- curacy of the disambiguation process for a full morphological analysis and for a word segmentation and POS tagging, (2) The contribution of the disam- biguator to other applications which use the tagged text. The disambigua- tor was tested on a wide-coverage test corpus of 90K tokens. We report an accuracy of 90% for full morpgological disambiguation, and 93% for word segmentation and POS tagging. In addition, we implement two applications to estimate the impact of the morphological data given by the disambigua- tor: Noun-phrase Chunker and Named-entity Indentifier. Both applications have shown improvement due to the improved morphological information provided by our disambiguator.

• Construction of a high-quality large-scale annotated corpus: We developed a tagged corpus of about 200K tokens. We developed a detailed set of tagging guidelines over a period of 3 years to make sure human taggers reach full agreement. Each article in our corpus was manually tagged by four taggers and disagreements were systematically reviewed and resolved.

• Tagset for Hebrew: The main morphological property of words is their lexical category – their part of speech. While working on annotation of Hebrew text, we surprisingly realized that a complete list of parts of speech is not well established, and that there is no agreement, among dictionaries and automatic tools, on the part-of-speech set for Hebrew. Our main conclusion is that the tagset and the tagging criteria used for a given language cannot be imported from another, nor rely on existing dictionaries. Instead, it should be specifically defined over large-scale corpora of a given language, in

xi order to tag all words with high agreement. In this work, we have detailed the method we applied to design a comprehensive tagset for Hebrew and reported the remaining intrinsically difficult confusion cases.

Keywords Computational linguistics, Natural language processing, Morphol- ogy, Hebrew, Parts-of-speech tagging, Morphological analysis, Morphological dis- ambiguation, Stochastic model, Unsupervised learning, Word-based representa- tion, Tagset design.

xii List of Figures

1.1 Architectureofthesystem...... 6

2.1 Disambiguation process schema...... 36 2.2 Disambiguatortypes...... 38

6.1 Markovprocess...... 105 6.2 Markov process for output sequence: start drinking...... 106 6.3 The search algorithm for a first-order token-based model...... 107 6.4 The learning algorithm for a first-order token-based model...... 107 6.5 Representation of the sentence: bclm hn‘im...... 111 6.6 Vector representation of the first three time slots...... 111 6.7 Representation of the sentence: hw’ ‘wrk dyn gdwl...... 111 6.8 Representation of the sentence: nwsp lskr hrgl...... 111 6.9 The learning algorithm for first-order word-based model...... 113 6.10 The searching algorithm for first-order word-based model...... 113

8.1 Firstordermodel–Dependencyscheme1...... 141 8.2 Partial second order model – Dependency scheme 1...... 141 8.3 Secondordermodel–Dependencyscheme1...... 141 8.4 Firstordermodel–Dependencyscheme2...... 142 8.5 Partial second order model – Dependency scheme 2...... 142 8.6 Secondordermodel–Dependencyscheme2...... 142

xiii xiv List of Tables

1.1 Possible analyses for the words bclm, hn‘ym ...... 3

2.1 Word categorization according to Rosen’s four categorial dimensions. 29 2.2 Parts of speech sets of various computational analyzers for Hebrew. 32 2.3 Distribution of inflections/derivations for Turkish...... 43

4.1 Statistics of the raw-text corpora used for morphological analysis. . 51 4.2 Statistics of the annotated corpora used for morphological analysis. 51 4.3 POS distribution of the lexicon entries...... 52

5.1 Parts of speech of selected modals in various dictionaries...... 59 5.2 Suggested POS lists for selected participle forms in various dictio- naries...... 75 5.3 Morphological classification of participle forms...... 75 5.4 Classification of Hebrew prepositions...... 95

6.1 Modelsizes...... 108 6.2 Statelist...... 112

7.1 Unknown token categories and distribution...... 121 7.2 UnknownsPOSdistribution...... 122 7.3 Evaluation of unknown token full morphological analysis...... 128 7.4 Evaluation of unknown token POS tagging...... 128 7.5 Common neologism formations...... 131

8.1 Initialconditions–scheme1,model2-...... 135 8.2 Dependencyschemes–model2-...... 136

xv 8.3 Modelorder–initialconditions ...... 137 8.4 Trainingsetsize–model2-,scheme1...... 138 8.5 Word model vs. Token model – scheme 1, model 2-, initial conditions139 8.6 Confusionmatrix...... 140

9.1 HebrewNP-chunkingresults...... 146 9.2 Named Entity Recognition – The combined model results...... 148

A.1 Verbinflections...... 155 A.2 Possessive pronoun suffixes...... 156 A.3 Shortformativewords...... 157 A.4 Parts-of-speechinflections ...... 158

xvi Imaginons un homme dont la fortune n’aurait d’´egale que l’indiff´erence `ace que la fortune permet g´en´eralement, et dont le d´esir serait, beaucoup plus orgueilleusement, de saisir, de d´ecrire, d’´epuiser, non la totalit´edu monde – projet que le seul ´enonc´esuffit `aruiner – mais un fragment constitu´ede celui-ci : face `al’inextricable incoh´erence du monde, il s’agirait alors d’accomplir jusqu’au bout un programme, restreint sans doute, mais entier, intact, irr´eductible. Bartlebooth, en d’autres termes, d´ecida un jour que sa vie toute enti`ere serait organis´ee autour d’un projet unique dont la n´ecessit´earbitraire n’aurait d’autre fin qu’elle mˆeme. Cette id´ee lui vint alors qu’il avait vingt ans. Ce fut d’abord une id´ee vague, une question qui se posait – que faire ? –, une r´eponse qui s’esquissait : rien. L’argent, le pouvoir, l’art, les femmes, n’int´eressaient pas Bartlebooth. Ni la science, ni mˆeme le jeu. Tout au plus les cravates et les chevaux ou, si l’on pr´ef`ere, impr´ecise mais palpitante sous ces illustrations futiles (encore que des milliers de personnes ordonnent efficacement leur vie autours de leur cravates et un nombre bien plus grand encore autour de leurs chevaux du dimanche), une certaine id´ee de la perfection. Elle se d´eveloppa dans les mois, dans les ann´ees qui suivirent, s’articulant autour de trois principes directeurs :

Le premier fut d’ordre moral : il ne s’agirait pas d’un exploit ou d’un record, ni d’un pic `agravir, ni d’un fond `aatteindre. Ce que ferait Bartlebooth ne serait ni spectaculaire ni h´ero¨ıque; ce serait simplement, discr`etement, un projet, difficile certes, mais non irr´ealisable, maˆıtris´e d’un bout `al’autre et qui en retour, gouvernerait, dans tous ces d´etails, la vie de celui qui s’y consacrerait.

Le second fut d’ordre logique : excluant tout recours au hasard, l’entreprise ferait fonctionner le temps et l’espace comme des coordonn´ees abstraites o`uviendrait s’inscrire avec une r´ecurrence in´eluctable des ´ev´enements identiques se produisant inexorablement dans leur lieu, `aleur date.

Le troisi`eme, enfin, fut d’ordre esth´etique : inutile, sa gratuit´e´etant l’unique garantie de sa rigueur, le projet se d´etruirait lui-mˆeme au fur et `amesure qu’il s’accomplirait ; sa perfec- tion serait circulaire : une succession d’´ev´enements qui, s’enchaˆınant, s’annuleraient : parti de rien, Bartlebooth reviendrait au rien, `atravers des transformations pr´ecises d objets finis.

Ainsi s’organisa concr`etement un programme que l’on peut ´enoncer succinctement ainsi : Pendant dix ans, de 1925 `a1935, Bartlebooth s’initieraita ` l’art de l’aquarelle. Pendant vingt ans, de 1935 `a1955, il parcourrait le monde, peignant, `araison d’une aquarelle tous les quinze jours, cinq cents marines de mˆeme format (65 X 50, ou raisin) repr´esentant des ports de mer. Chaque fois qu’une de ces marines serait achev´ee, elle serait envoy´ee `aun artisan sp´ecialis´e(Gaspard Winckler) qui la collerait sur une mince plaque de bois et la d´ecouperait en un puzzle de sept cent cinquante pi`eces. Pendant vingt ans, de 1955 `a1975, Bartlebooth, revenu en France, reconstituerait, dans l’ordre, les puzzles ainsi pr´epar´es, `araison, de nouveau, d’un puzzle tous les quinze jours. A mesure que les puzzles seraient r´eassembl´es, les marines seraient ≪ retextur´ees ≫ de mani`ere `ace qu’on puisse les d´ecoller de leur support, transport´ees `al’endroit mˆeme o`u – vingt ans auparavant – elles avaient ´et´epeintes, et plong´ees dans une solution d´etersive d’o`une ressortirait qu’une feuille de papier Whatman, intacte et vierge.

Aucune trace, ainsi, ne resterait de cette op´eration qui aurait, pendant cinquante ans, enti`erement mobilis´eson auteur.

(Georges Perec, La Vie Mode D’emploi)

xvii xviii Chapter 1

Introduction

Au d´epart, l’art du puzzle semble un art bref, un art mince, tout entier con- tenu dans un maigre enseignement de la

Gestalttheorie.

Words in a language commonly have more than one ‘reading’. Consider, for

1 instance, the Hebrew word clm. From a lexical point of view, this word can be derived from four different lexemes:(1) calam, (2) celem, (3) cilem, (4) culam. These lexemes are classified into two lexical categories or parts of speech:

• Noun: calam, celem.

• Verb: cilem, culam.

Semantically, one can identify several senses for each lexeme. The lexeme celem, for instance, can be interpreted either as an image or as an idol. From the morphological point of view, the form of the word can be inter- preted in various ways, through its inflection pattern:

• For nouns, inflection according to gender, number, and status:

– masculine.singular.absolute: calam, celem.

1Transcription according to [93].

1 – masculine.singular.construct: calam, celem.

• For verbs, inflection according to gender, number, person, and tense:

– masculine.singular.third.past: cilem, culam.

In Hebrew, nouns, adjectives, verbs, prepositions, and adverbs can be agglu- tinated with a pronominal pronoun suffix. This process introduces another di-

mension of ambiguity. The same word clm can be derived from two possible lexemes: the noun cel and the verb calah. The lexeme cel, itself, has several senses: shadow, shade, iota, and trace. Morphologically, two additional analyses that use a pronoun suffix are possible:

• As a noun with a possessive pronoun suffix, inflected by gender, number, and person:

– masculine.singular.construct noun + masculine.plural.third pronoun: cilam.

• As a verb with an accusative possessive pronoun suffix, inflected by gender, number, and person:

– masculine.singular.third verb + masculine.plural.third pronoun: calam.

These additional analyses cause ambiguity in the word segmentation:

• clm: calam, celem, cilem, culam.

• cl-m: cilam - hacel ˇselahem, calam - cilem ’otam. The word clm, can be attached to a sequence of formative letters,

m,ˇs,w,k,l,b, and the definite article, h, as a prefix. This prefixing mechanism introduces even more ambiguity in the analysis of the word. As a result, the

word bclm, for instance, can be interpreted in seven possible morphological analyses, as shown in Table 1.1. The task of a morphological analyzer is to produce all possible analyses for a given word, regardless of their sense. The task of a morphological disambiguator

2 Word Segmentation Analysis Translation pn becelem (name of an association) bclm vb.inf bcalem (while taking a picture) bcl-m nn.masc.sing.cons+pro bcalam (their onion)

b-cl-m prep+nn.masc.sing.cons+suf bcilam (under their shades) prep+nn.masc.sing.abs bcalam (in a photographer) b-clm prep+nn.masc.sing.cons bcalam (in a photographer) prep+def+nn.masc.sing.abs bacalam (in the photographer) h-n‘im sub+vb.pl.present hana‘im (that are moving)

def+adj.masc.sing.abs hana‘im (the lovely) hn‘im vb.masc.sing.third.past him‘im (made pleasant)

Table 1.1: Possible analyses for the words bclm, hn‘ym . is to pick the most likely analysis among those produced by an analyzer. In order

to select the ‘most likely analysis’, the context of each word should be taken into account. For example, given the phrase bclm hn‘ym, we can conclude

that the most likely analysis for the word is bcilam - under their shades. This work deals with morphological disambiguation of words in Modern He- brew text. We are interested in developing a system that takes as input free text in Hebrew, and for each word selects the most likely morphological analysis – what lexeme, prefix, and suffix it includes and for each of these, provide its part of speech and the list of its inflections.

1.1 Motivation

Morphological disambiguation is an essential component in many natural language processing (NLP) applications. An information retrieval system, for instance,

should find the correct part-of-speech of the Hebrew token h. oze, in order to index in term of contract (noun) or watch (verb). A text-to-speech system should

determine the gender of the Hebrew token ’iˇsah: feminine (a woman) or masculine (her husband). A machine translation system must find out whether

the tense of the Hebrew verb is sprw: imperative (count!) or past (they cut a hair). In addition, morphological analysis can be used as a knowledge base input for other applications. A syntactic parser makes use of the lexical category of the

3 tokens in the text. A noun phrase chunker may be interested in the construct state property of the words to be chunked. Word prediction can be more accurate if the morphological attributes of previous words are taken into account.

1.2 Starting Points

In the case of English, because morphology is quite simple, morphological disam- biguation is generally covered under the task of part-of-speech tagging. Various lists of parts of speech have been used in various tagging projects for English, where the size of these tagsets range from 48 to 195. The reduced set leaves out information that can be recovered from the identity of the lexical item. Several state-of the-art supervised taggers have been developed in the past two decades – Stochastic taggers (Hidden Markov models [71], decision trees [105], neural net- works [16], memory-based learning [32], and maximum entropy [99]) – and the Transformation-based tagger of Brill [23]. The accuracy of these taggers is around 96%–97%. In Hebrew, several computational analyzers were developed in the past decade: HMA [26], Segal [108], HAMSA [124], Rav Milim [29], and MILA 2. In addi- tion, several disambiguators were built over these analyzers, such as Levinger [73], HMD [26], Segal [108], and Bar-Haim [11]. The best reported accuracy for POS tagging and word segmentation is 90.8 [11, Bar-Haim et al.]. Recently, Shacham and Wintner [111] reported an accuracy of 91.44% for full morphological disam- biguation. A large scale Hebrew annotated corpus, which can be used for testing the system, is not yet available. Moreover, a complete list of parts of speech is not well established, and there is no agreement, among dictionaries and automatic tools, on the parts-of-speech set for Hebrew. The results discussed above should, therefore, be taken as rough approximations of the real performance of the systems until they can be re-evaluated on such a large scale corpus with a standard tag

2http://yeda.cs.technion.ac.il:8088/XMLMorphologicalAnalyzer/ XMLOutputAnalyzer.html

4 set. The disambiguators mentioned above rely on annotated corpora. We are in- terested in unsupervised learning methods, which can handle the dynamic nature of Modern Hebrew, as it evolves over time.

1.3 Implementation

The system we implemented gets text in Hebrew as input and provides one mor- phological analysis for each word in the text. The system is based on the official analyzer of MILA – from the Knowledge Center for Processing Hebrew(hereinafter KC analyzer), and the output follows its Hebrew corpus schema.3 The architecture of the system is described in Figure 1.3. The given text is pro- cessed through a pipeline of tokenization, morphological analysis, and disambigua- tion. As can be seen, unknown words are processed in two stages: (1) unknown token analysis according to a pattern+letters model and (2) post-processing of proper name identification. In addition, the system includes modules for Noun- phrase chunkning and Named-entity recognition, based on Goldberg et al. [50] and Ben-Mordechai [15], as applications used to measure the impact of morphological disambiguation on higher-level tasks. Most of the modules are implemented in Java. Some of the applicative modules are implemented in Python modules.

1.4 Contributions

The main contributions of this work are:

Analysis system for Hebrew We have implemented a complete analysis sys- tem for Hebrew that combines all the algorithms and models described in this work. Given a Hebrew text, the system assigns a full set of mor- phological features for each word, extracts noun phrases, and recognizes

3http://mila.cs.technion.ac.il/hebrew/resources/standards/hebrew corpus

5 Text [plain]

Tokenizer

Tokenized text [XML]

Analyzer Lexicon

Analyzed text [XML]

Unknown Tokens Analyzer Letters Model

Analyzed text with no unknowns [XML]

Disambiguator Prob. Model

Disambiguated text [XML]

PN Identifier SVM Model

ME Model Named−entity Recognizer Noun−phrase Chunker SVM Model

Figure 1.1: Architecture of the system.

6 entity names (persons, locations, organizations, temporal and number ex- pression). A fully operating version of the system is available online at: http://www.cs.bgu.ac.il/∼nlpproj/demo. The system is implemented in Java, and operates at a rate of about 1,000 words analyzed per second on a standard 2007 PC (1GB of RAM).

Unsupervised learning model for an affixational language In contrast to English tag sets, whose sizes range from 48 to 195, the number of tags for our Hebrew corpus, based on all combinations of morphological attributes, is about 3,600 (about 70 times larger). The large size of such a tag set is problematic in term of data sparseness. Each morphological combination appears rarely, and more samples are required in order to learn the proba- bilistic model.

In order to avoid this problem, we introduce a word-based model, including only about 300 states, and where the size of the HMM matrices is reduced by close to 90%. We have defined a text encoding method for languages with affixational morphology, in which knowledge of word formation rules (which are quite restricted in Hebrew) helps in the disambiguation. We adapt the HMM algorithms for learning and searching this text representation in such a way that segmentation and tagging can be learned in parallel in one step.

The application of this model, as opposed to the traditional token-based model, improves the model accuracy with over 13% error reduction.

Initial conditions Initial conditions are essential for high quality unsupervised learning of HMM models. We investigated two methods for initial conditions: morpho-lexical approximations, and syntagmatic conditions. Our main work was to adapt these methods to the comprehensive tagset we have designed for this work. We have shown that good initial conditions improve model accuracy with over 15% error reduction.

Unknown words analysis The term unknowns denotes tokens that cannot be analyzed by the morphological analyzer. These tokens can be categorized

7 into two classes of missing information: unknown tokens which are not recog- nized at all by the analyzer, and unknown analyses, where the set of analyses proposed by the analyzer does not contain the correct analysis for a given token.

We investigated the characteristics of unknowns in Hebrew and methods for unknowns resolution. For the case of unknown tokens, we examined pattern- based and letter-based models. The combined letter+pattern model gives the best results in terms of coverage and accuracy. This model provides a distribution of possible tags for unknown tokens. Unknowns are then pro- cessed by the disambiguator according to the distribution the letter+pattern model provides, as if they had been present in the lexicon. In addition, we developed a post-processing model to recognize proper names within the output of the disambiguator.

Unknowns account for 7.5% of the tokens in our corpus. The pattern+letters model with the proper name classifier properly classifies 79% of these in- stances, thus contributing an error reduction of 23% as opposed to the base- line consisting of tagging unknowns with the most likely tag (Proper Name – which would only tag correctly one-third of the unknowns).

Evaluation The system was evaluated according to two criteria: (1) The accu- racy of the disambiguation process for a full morphological analysis and for a word segmentation and POS tagging; (2) The contribution of the disam- biguator to other applications which use the tagged text.

The disambiguator was tested on a wide-coverage test corpus of 90K tokens. We report an accuracy of 90% for full morpgological disambiguation, and 93% for word segmentation and POS tagging. As part of the evaluation, we compared several graphical dependency schemes, model orders, and different sizes of training data.

In addition, we implemented two applications in order to estimate the im- pact of the morphological data given by the disambiguator: Noun-phrase

8 Chunker and Named-entity Identifier. Both applications have shown im- provement due to the improved morphological information provided by our disambiguator.

Construction of a high-quality large-scale annotated corpus We developed a tagged corpus of about 200K tokens. The corpus is composed of articles of two daily newspapers - Ha’aretz and Arutz 7. We developed a detailed set of tagging guidelines over a period of 3 years to make sure human taggers reach full agreement. Each article in our corpus was manually tagged by four taggers and disagreements were systematically reviewed and resolved.

Tagset for Hebrew The main morphological property of words is their lexical category - their part of speech. The central parts of speech (verb, noun, and adjective) are part of the basic linguistic intuitions of all speakers. However, while working on annotation of Hebrew text, we surprisingly realized that a complete list of parts of speech is not well established, and that there is no agreement, among dictionaries and automatic tools, on the part-of-speech set for Hebrew. Beyond verb, noun, and adjective, many other lexical units appear in text, and each raises potential questions as what is meant by part of speech, what is the best way to label every unit in a document, and how to distinguish among the various labels. The tags we suggests for Hebrew are: adjective, adverb, conjunction, copula, existential, interjection, interrogative, modal, negation, noun, numeral, pro- noun, preposition, proper name, quantifier, title, verb, prefix. Our main conclusion is that the tagset and the tagging criteria used for a given language cannot be imported from another language, nor rely on existing dictionaries. Instead, it should be specifically defined over large-scale cor- pora of the given language, in order to tag all words with high agreement. In this work, we have detailed the method we applied to design a comprehensive tagset for Hebrew and report the remaining intrinsically difficult confusion cases.

9 1.5 Guide to the Rest of the Dissertation

The rest of the dissertation is organized as follows: In chapter 2 we present an overview of Hebrew morphology with respect to the disambiguation process. We focus on the definition of words in Hebrew, the lexical categories, word formation rules, inflectional properties, nature of the Hebrew lexicon, the role of morphological analyzers, and disambiguation methods. The objectives of the work are listed in chapter 3, and the resources we devel- oped and use are described in chapter 4. Chapter 5 deals with the tagset design for Hebrew. The tagset design method- ology we present is corpus-based and application-oriented. For each tag, the crite- ria and tests that determine whether a token should be tagged, come from corpus observations. Our main objective is to ensure consistency, coverage, and tagging agreement. The tag criteria are basically morpho-syntactic, with semantic con- siderations and practical deliberation. Each tag is measured by its impact on the disambiguation process, and on applications which make use of the morphological properties. We demonstrate the use of our methodology for the definition of four categories in Modern Hebrew which we found difficult to characterize – modals, participles, prepositions, and adverbs. In chapter 6, we present a text encoding method for languages with affixa- tional morphology in which the knowledge of word formation rules helps in the disambiguation. We adapt the HMM algorithms for learning and searching this text representation in such a way that segmentation and tagging can be learned in parallel in one step. In chapter 7, we address the problem of unknowns in Hebrew. The term unknowns denotes tokens that cannot be analyzed by the morphological analyzer. These tokens can be categorized into two classes of missing information: unknown tokens which are not recognized at all by the analyzer and unknown analyses, where the set of analyses proposed by the analyzer does not contain the correct analysis for a given token. We investigate the characteristics of unknowns in Hebrew and present methods to handle such unavoidable lack of information.

10 In chapter 8, we evaluate the system we developed in terms of disambiguation accuracy for full morphological analysis, and word segmentation and POS tagging. The system was tested on our text corpus, composed of 90K tokens. Overall, our best results for full morphological analysis is 90% accuracy, and for word segmentation and POS tagging 93% accuracy. As part of the evaluation, we investigated the contribution of initial conditions, the impact of the structure of the stochastic model (dependency relations), the order of the model (bigrams or trigrams), and the size of the training set (unsupervised). In addition, we compare the token-based model with the word-based model, and present detailed error analysis. In chapter 9, we measure the contribution of the disambiguation system to other applications which use the tagged text. We describe two applications which were implemented for this purpose: Noun-phrase chunking and Named-entity recognition. We show that the morphological features given by the disambiguator improve the performance of these systems. We conclude in chapter 10, with a review of the contributions of this work, as well as suggested future work.

11 12 Chapter 2

Background

Chaque geste que fait le poseur de puzzle, le faiseur de puzzles l’a fait avant lui : chaque pi`ece qu’il prend et reprend, qu’il examine, qu’il ca- resse, chaque combinaison qu’il essaye et essaye encore, chaque tˆatonnement, chaque intuition, chaque espoir, chaque d´ecouragement, ont ´et´e d´ecid´es, cal-

cul´es, ´etudi´es par l’autre.

This work deals with morphological disambiguation of words in Modern He- brew text. We are interested in developing a system that takes as input free text in Hebrew and for each word, selects the most likely morphological analysis – what lexeme, prefix, suffix it includes and for each of these, provide its part of speech and the list of its inflections. In order to build such a system, we must clarify what we mean by ‘word’ in Hebrew and the possible values for the parts of speech, as well as the morphological features each category can exhibit.

In this chapter, we review basic morphology concepts and theory, with a fo- cus on Hebrew morphology. We highlight the characteristics, complexity, and challenges of morphology in Hebrew. We then discuss the morphological disam- biguation problem from a computational perspective as applied to Hebrew.

13 2.1 Morphological Models

Morphology is the field of linguistic theory that deals with the internal structure of words. It relates to phonetics, which is concerned with the typology of the speech sounds, and syntax, which covers the construction of sentences and phrases. Various linguistics frameworks assign different importance to the field. Structural linguistics considers phonology and morphology as the fundamental components of grammar. Generative models, on the other hand, focus on syntax as the main element of the generative process. Following Hockett [62], one can identify three basic morphological models, based on three different units:

• The item-and-arrangement model investigates the ways morphemes are com- bined to form a word. The basic unit in the lexicon, according to this ap- proach, is the morpheme.

• The item-and-process model describes the process of producing one word type from another by applying derivation, inflection, and compounding rules. The basic units of this model are the lexemes.

• In the word-and-paradigm model, the basic unit is the word, which is ana- lyzed according to a set of inflection rules over a set of parameters (such as gender, number, and person).

As mention by Schwarzwald [106, volume 1, p. 20], according to Hockett, each of these three models covers some aspect of morphology: the item-and-arrangement model categorizes word components, the item-and-process model describes the dynamic nature of word formation over time, and the word-and-paradigm model supports input conditions and application of rules. In this work, we are mainly concerned with the item-and-arrangement ap- proach: our task is to analyze ambiguous combinations of word units and derive an interpretation in terms of base form and morphological features. In the follow- ing sections, we review the definition of morphemes (the basic units that compose words), lexemes, and the composition rules that apply in Hebrew.

14 2.2 Morpheme

According to Bloomfield [20, p. 161], the term morpheme denotes a linguistic form which bears no partial phonetic-semantic resemblance to any other form. Gleason [47, p. 52] defines morpheme as the smallest unit which is grammatically pertinent (he is aware of the circularity of this definition – grammar is the study of morphemes and their combinations). According to Hockett [63, p. 123], a mor- pheme is the smallest individual meaningful element in utterances of the language. We follow Aronoff ( [6, p. 79] and Schwarzwald [106, section 2.2.1]) in defining a morpheme as a minimal symbol of some form which may have meaning, and that can be identified as a component within a word. The morpheme definition has, therefore, a grammatical rather than lexical meaning [106, volume 1, p. 76]. One can classify morphemes according to three parameters:

• Function: Root morphemes carry lexical meaning and typically belong to a lexical category. Derivational morphemes carry grammatical lexical infor- mation and serve to form new words. Derivational morphemes may change the lexical category of the word. Inflectional morphemes carry grammatical information in order to represent the grammatical categories of the inflected lexical categories.

• Degree of autonomy: Free morphemes can stand alone in a given clause, whereas bound morphemes must be attached to other elements.

• Reproducibility: Reproducible morphemes (most of the morphemes) are saved in the ‘mental’ lexicon and can be recalled into new combinations. Unique morphemes, generally bound and mostly dialectal, obsolete, or mis- analyzed forms, are non-productive and exist in few combinations.

In addition to these parameters, the also distinguishes be- tween root and pattern morphemes. Goshen-Gottstein [52] defines root morphemes

as consonant-only sequences, e.g., p.q.d, and pattern morphemes, mostly derivational, as vowels and dagesh (accent) combinations only, e.g., , In some patterns, one consonant can also be introduced, e.g.,

15

The combination of a root and a pattern morpheme forms a base: e.g.,

paqad (ordered), piqed (commanded), hitpaqed (was counted), nipqad (was absent). A stem is a morphological segment consisting of a base and

a set of derivational suffixes, e.g., nipqadut (absenteeism) [102, pp.124–125]. In our work, we are directly concerned with the identification of stem and inflectional morphemes in Hebrew. The analyses our system produces do not include derivational morphemes: we do not aim to reconstruct the derivation process that leads to the existence of a word but only the composition process of a word as a composition of a base lexeme, and bound inflectional morphemes. For example, we do not analyze destruction as the nominalization of destroy. However, in our work on the analysis of out-of-lexicon words, we do exploit the derivational and pattern structure of the words, to “guess’ the possible analyses of unknown words.

2.3 Lexeme

In the process of analyzing a sentence, we first tokenize the sentence into a sequence of ‘words’. In this section, we attempt to Clarify the definition of a word as a base unit that can be listed in a dictionary. The basic questions we address are: (1) what are the primitive units that should belong to the dictionary and (2) how do such entities make their way into the lexicon over time – that is, what are the processes that can produce new entries into a dictionary? Practically, since our system includes a lexicon, this section lists criteria that determine which elements should belong to a lexicon.

2.3.1 Definition

According to Aronoff and Fudeman [7, section 2.3.3], a lexeme is a word with a specific sound and a specific meaning. Its shape may vary according to the syntactic context. A lexeme distinguishes among phonologically/orthographically similar forms with different meanings, such as the three meanings of the word dog:

16 (1) a canine, (2) a hook device used for gripping heavy objects, (3) a verb meaning to follow persistently (see also [116, section 1.1]). Zwicky [130] deals with the properties of word types and their mapping to lexemes. If we denote a syntactic atom1 by W, one can rank the sub-expressions of a language into three categories: clause rank, phrase rank, and W rank. WMIN denotes a W which does not contain any other W, whereas WMAX denotes a

W which is composed of the maximal number of W s (that is, WMAX is not contained by any larger W ). The subexpression saving bank location information, for instance, contains seven W s. The whole expression { saving bank location information} is considered to be a WMAX , where each of the words {saving, bank, location, information} is a WMIN . Zwicky argues against a one-to-one mapping between W and lexeme. In such a mapping, each W instantiates a single lexeme, and each lexeme is instantiated by a single W. He points out that parts of some lexemes may act as a W syntactically. Furthemore, some lexemes are morphologically indivisible, but correspond to a sequence of W s. And finally, there are W s which, as a whole, instantiate no lexeme. Instead, Zwicky suggests that all WMIN s of an expression instantiate some lexeme, and that no WMIN simultaneously instantiates two distinct lexemes, with morphology overriding syntax in case of conflicts [126]. 2 For the purpose of building a concrete lexicon for a morphological analysis system, Zwicky’s approach seems to be hard to implement. Instead, we adopt the definition of Allon [4, pp. 17–18, 33], which denotes lexeme as a stem. The concerns of Zwicky’s approach remain relevant though, as far as handling multi- word expressions is concerned. We have kept the handling of such complex lexical entities for future work.

2.3.2 Derivation

Lexeme derivation is the study of the structure of complex stems, concerned with the rules of combining smaller building blocks to form bigger units (for an overview

1See Di Sciullo and Williams [107, chapter 3]. 2See also [116, chapter 3].

17 of lexeme formation in English see [21, p. 118]). As mentioned above, the study of derivation in our context is important to help us guess the possible analyses of words that are not found in the current version of the dictionary. Allon [4, pp. 35–36] lists five different types of stems (or lexemes): (1) primitive base with minimal meaning, (2) base which is composed of root and pattern, (3)

complex base which combines two or more bases, (4) base and affixation, and (5)

null base, such as words composed of preposition and pronoun ( ’eclo, lka, ‘imo). Schwarzwald [106, section 4.2] discusses the various methods of lexeme for- mation in Hebrew. This approach relies on Goshen-Gottstein [52] and May- erthaler [80], as follows:

Words as being External words, from different sources, get into the lexicon ac-

cording to the Hebrew phonological template system. The noun history, for instance, was introduced into the lexicon as hist.oryh (the suffix

yh is used to denote nouns). The adjective hysterical, on the other hand, got into the Hebrew language as hist.ry (the suffix y distinguishes adjectives from nouns). Berman [17, p. 426], Ravid [100, pp. 322–323], and Ornan [94] include in this type of derivation, conversion or null derivation, i.e., categorical movement, which assigns a new syntactic category to an

existing word (without being phonetically realized). The adverb ‘e´ser

(great!) in Hebrew, for instance, was derived from the numeral ‘e´ser

(ten). Many nouns in Hebrew were derived from beinoni verbs, e.g.,

madrih. (a guide).

Composition Combination of root and pattern morphemesto generate a base, to

generate more complex words. The combination of the root k.t.b and

the pattern , for instance, gives the verb hitkateb (corresponded).

Linear derivation Different sub-types of linear derivation can be distinguished:

• Affixation

– Suffixation of bound morpheme to a stem

18

∗ ’ay suffix: - h. aˇsmal – h. aˇsmlay (electricity –

electrician).

∗ an and wut suffixes: - - prds – pardsan – pardsanut (orchard – citrus grower – citrus growing).

– Prefixation of free morphemes to a stem

∗ Prefix word attached to nouns or adjective: - ’b –

’abt.ipus (prototype), - ’ein – ’einswp (infinity), -

mad – madh. om (thermometer).

∗ Prepositional letters attached to nouns and adjective

· Adverbs: b+mhirut – bimhirut (in fast

manner), h+yom – hayom (today),

k+h+muban – kmuban (certainly).

· Prepositions: l+yad – lyad (beside),

b+tok – btok (inside).

· Demonstrative pronouns: h+hu’ – hahu’ (that one).

• Compounds

– Multi-word expression – composition of two or more independent

words:

∗ Noun + noun: ma´sa’ wumatan (negotiation).

∗ Construct state: derec ’erec (courtesy).

∗ Noun + adjective: ‘ain yapah (with generosity) ∗ Noun + preposition+noun: pa‘am bpa‘am (from time

to time). ∗ Verb + complement: ´sam leb (paid attention).

– Compound word - concatenation of full stems or part of stems into

one word

∗ ram+qol – ramqol (megaphone)

∗ kol+bow – kolbow (supermarket)

19

∗ midrakah+rh. ob – midrkob (pedestrian mall)

∗ rekeb+kebel – rakebel (cable railway).

– Acronym - encoding of a phrase into one word

∗ Pronounced by its own diction: sakinim

kapot wumizlagot – sakwu”m (cutlery), din wh. eˇsbon

– dwo”h. (report).

∗ Pronounced by its letters: brit hamo‘acot

– brh”m (The Soviet Union).

∗ Pronounced by the words it encodes: ˇsin gimel – ˇs”g (entrance guard).

2.4 Hebrew Word Definition

Consider the Hebrew token bclm, which has, as shown earlier, seven possible morphological analyses. What are the words which compose the token according to each analysis? Should the preposition prefix b of the bcalam (in a photographer) reading be considered as a part of the word calam (a photographer) it is attached to? What about the definite article of the bacalam (in the photographer) reading? Does the pronominal pronoun suffix of the analysis becilam (under their shades) inflect the base word or attach to it as a clitic? In the following sections we discuss some of these issues, which are specific to the rules of word formation in Hebrew.

2.4.1 General Overview

Following Spencer [115, p. 41], the question “what is a word and how is one to be recognized?” is one of the difficult and important problems in morphologi- cal theory. According to the orthographic view, words correspond to tokens – sets of characters delimited by spaces. This definition might fit English, but it seems not to match many other languages. Polysynthetic languages,3 such as Chukchee, permit processes such as noun incorporation, so that a single token

3Discussions on morphological typology can be found in [115, pp. 37–39] and [96].

20 can encode meaning, which would require a fairly elaborated sentence in many other languages. The dictionary of heavily isolating languages, like Chinese, lists morphemes and their collocations rather than words. Hebrew, which has both fu- sional and agglutinative characteristics, can attach, for instance, preposition and

possessive morphemes to a noun. For example, the three English words to his home can be expressed in Hebrew by one token l-beit-w, where the prefix

l is the preposition to and the suffix w is the possessive his. Bloomfield [20] defines word as a minimal free form. In such a way, words are the smallest meaningful units of speech which can stand by themselves. In some languages, word boundaries are marked by phonological phenomena, such as the span of vowel harmony, the position of stress, or phonetic constraints which make reference to word boundaries [115, p. 42]. In Chukchee and Finnish, for instance, words are delimited by the span of vowel harmony. Spencer [115, p. 42] points to the circularity of this criterion. Compound words in Finnish have a harmony span for each of their components, but on the other hand, there is only one stress for compound word (it always falls on the first syllable of the compound), which indicates the existence of one word. The stress of Czech words always falls on the first syllable, but, there are cases, such as monosyllabic prepositions before an unmodified noun, where the stress is attached to the preceding word. According to this analysis, the first of the following sentences will contain two words, the second one will have three words, but only one word will be identified for the third sentence:

(1) a. ten st˚ul ‘that table’ b. na ten st˚ul ‘onto the table’ c. na st˚ul ‘onto the/a-table’

Finally, Yoon and Benmamoun [126] indicate that the phonological characteriza- tion does not coincide with other dimensions, so it is not considered definitive. Following the rules of syntax, which take words as their smallest unit and compose them into phrases, one can define word as the minimal free form – the smallest unit which can exist on its own, i.e, word is primitive with respect to the

21 syntactic rule system. Spencer [115, p. 42] points to two problematic construc- tions. The first problem relates to compounds: when two words are compounded, each is a minimal free form by definition. On the other hand, the question whether the resulting compound is a word is changed from the morphological perspective (yes) to the syntactic perspective (no). The second problem is posed by clitics. The term clitic denotes an item which resembles a word but must be attached to another word [97, p. 498], such as the contracted negative particle n’t. One can claim that cliticization is part of word formation and that clitics are really affixes,

but, can we consider all the Hebrew words that are cliticized by the conjunction w

(and) as inflected words? What kind of prefix is when there are no morphological constraints on its attachment (only syntactic ones)?4 According to the semantic view, the term word refers to a semanteme – an irreducible unit of meaning. Spencer [115, p. 42] mentions the lexical integrity principle – no syntactic process is to be allowed to refer, exclusively, to parts of words. In such a way words are referentially opaque – we cannot ‘look’ inside them in order to refer to their parts. For example:

(2) a. a pound of tea b. a teapot

The first expression is not considered to be a single word, since tea refers to a particular kind of stuff, whereas the tea part of teapot does not refer to tea stuff. A reference to a packet of coffee cannot be expressed by pound of tea, since tea does not mean coffee. But making coffee with a teapot is not considered to be a semantic error. In such a way, anaphoric devices can be used to refer to the tea part of a pound of tea but not the tea part of teapot:

(3) a. He took a [pound of teai] and put two spoonfuls of iti into the teapot

b. * He took the [teai-pot] and poured iti into the cup

There are three known exceptions to this criterion, which can be alternatively interpreted without questioning the whole word definition [126]. One of the

4According to Nir’s view, they are considered to be grammatical morphemes – see 2.4.3 below.

22 exceptions is the opacity of phrasal idioms, which syntactically contain several words (4b) but semantically are regarded as one word (4a).

(4) transformational grammarian a. [[transformational grammar]-ian] b. [[transformational] [grammarian]]

Lexicographers consider a word as an entry in the lexicon. Di Sciullo and Williams [107] presented the notion of listedness criteria:5 a word is something that is registered in long-term memory. As shown by Di Sciullo and Williams, this definition is problematic since we can find listed syntactic objects (such as phrasal idioms), as well as unlisted or unlistable morphological objects (such as productive affixations). In the following sections, we define the specific rules we have adopted to define the notion of ‘word’ in Hebrew. Our approach is pragmatic: for each case of aggregation in Hebrew morphology, we determine whether the phenomenon is an inflection, an affixation, or a clitic.

2.4.2 Definite Article

Hebrew marks definiteness in a different way than English. The definite article of

Hebrew, ha, does not inflect. It is attached pronominally to nominals: common

nouns ha-bait (the house), adjectives ha-’almoni (the anonymous),

demonstratives hahu’ (that one), and numbers ha-rbi‘i, ha- ˇsaloˇs (the fourth, the three). The definite article in Hebrew cannot attach to

phrases; in order to make a phrase definite, each of the elements must be explicitly

definite: ⇒ *⇒ yeled h. amud nolad ⇒ hayeled hah. amud nolad *⇒ hayeled h. amud nolad, in contrast to English: a sweet child was born ⇒ the sweet child was born. The definiteness of proper nouns,

construct nominals and possessive is implicit, and has no definite article, e.g.,

dan hah. amud nolad (sweet dan was born), qise hamelek

5In addition to the notions of morphological object and syntactic atom.

23

tuqan (the king’s chair was fixed), tinoqam hah. amud tuqan (their sweet baby was born). Should the definite article in Hebrew be treated as a stand-alone word or as a clitic which inflects absolute nominals with a definiteness mark (as is the case of proper noun construct state and possessives)? Wintner [121] investigated the definiteness in the Hebrew noun phrase, claiming that the Hebrew definite article is much closer to an inflectional affix, rather than a clitic or a stand-alone word. Wintner suggests several tests to justify identification of the definite article as an affix rather than a clitic, as follows:

• Semantic idiosyncrasy is more characteristic of affixed words than of clitic groups. Wintner points to contexts in which definite and indefinite noun phrases have identical meanings. In addition, he demonstrates cases of de- termination that are not carried out through the use of the definite article.

• Affixes are more selective than clitics. The Hebrew definite article is attached only to nominals, and is used to denote an entity.

• According to Miller [83], if an item must be repeated on each conjunct in a coordinate structure, then it must be an affix and cannot be a clitic. For

the case of the definite article in Hebrew, when elements are definite,

ha must be repeated for each of the conjuncts: ‘einayim

gdolot wyrukot (big green eyes), ha‘einayim hagdolot

whayrukot, * ha‘einayim hagdolot wyrukot.

• The complement of a construct state adjective can be a noun, but not a

noun phrase, e.g., h. atul yroq ‘einayim (a green-eyed cat),

* h. atul yroq ‘ayin ’ah. at (a cat with one green eye). However,

the modifier noun can be preceded by the definite article, e.g., hah. atul yroq ha‘ayin ha’ah. at (the cat with one green eye). This observation supports the concept that a definite noun is not a phrase but a single word.

24 We adopt Wintner’s approach, despite evidence for clitic interpretation: defi-

nite proper names – hapridmanim (the Fridmans) – and spoken Hebrew constructions that assign a definite article to construct state nouns, e.g., ha‘orkei din (the lawyers).

2.4.3 Formative Letters

In Hebrew, the formative letters - m, ˇs, w, k, l, b - can be attached as

prefixes to a word, indicating functions, such as coordination – w-bait (and a home), preposition – b-bait (at home), subordination ˇsbbait (that at home), and more [4, p. 33], [106, section 3.2.5]. Har’el and Kenigsberg [60] listed 431 possible combinations of these formative letters, 76 of which were realised in a corpus of about 10M tokens [123]. Yona discusses the syntax of the prefix combinations. He defines a set of thirteen short formative words, based on the 7 formative letters. These short formative words, when combined together, can compose all of the 76 observed prefix sequences, as listed in the Appendix A.3 with examples. For a list of possible prefixes for each lexical category, see Appendix A.5. Nir [89, p. 22] considers the formative letters as grammatical morphemes6, i.e., the affixation of formative letters inflects the word with preposition, coordination, subordinating marks, etc. Ornan [95, p. 132], on the other hand, considers these letters as words which are orthographically agglutinated to the base word. We adopt Ornan’s approach. The basic argument is the fact that the attach- ment of these letters does not restrict the inflection of the base word. These letters have clear lexical classification, such as prepositions and conjunctions. The tests

suggested by Wintner and discussed above do not give an unequivocal answer:

ˇs, kˇs, ml, mb Low selectiveness – can attach verbs, nouns, adjec-

tives, prepositions and adverbs. No repetition is required in coordinate

structures: h. aˇsabti ˇsh. am weyabeˇs (I thought it was hot and

6See also Rosen’s definition of the case property [102, p. 99]

25

7

dry) – same for kˇs, ml, mb.

w The conjunction prefix has very low selectivity, and has no semantic idiosyn- crasy.

b,k,l,m The grammatical requirement for repetition of the preposition pre-

8 fix on each conjunct in a coordinate structure, e.g., bhalel

wubhodot (with praise and with thanks), is usually ignored in the spoken language, e.g., bˇsebah. whodayah. Even though preposition let- ters are more selective (they cannot attach to prepositions, adverbs, verbs), morpho-syntactic considerations distinguish them from the word to which

they are attached. They do not inflect and can stand alone with a nomi-

native pronoun suffix, e.g., bo, lah, mehem (in him, to her, from

them). The restriction of independent pronouns, e.g., * b hu’, * b hem, is common for preposition words, e.g., * ecel hu’ (at him). From a syntactic point of view, they fill the same role as preposition words, and semantically, they express a ‘relation between words’ as do preposition words (see 5.5.1).

2.4.4 Pronoun Suffix

The pronoun suffix fills the role of possessive for nouns, of accusative for verbs,

and of nominative for prepositions, adverbs, and some first-person verbs:

• Possessive: sipro – haseper ˇselo (his book).

• Accusative: saper ’oto – sapro (cut his hair).

• Nominative: ’ecel hu’–’eclo (at his), ‘od ’ani’ –

‘odeni (still I), lbad hen – lbadan (by themselves),

h. oˇseˇs’ani – h. oˇsˇsani (I am afraid lest).

7

The grammatical construction h. aˇsabti ˇsh. am weyabeˇs is a coordination of the activity: I thought it was hot (I thought) it was dry. 8Ezra 3, 11.

26 As noted by Glinert [48, p. 52], appending an accusative pronoun to verbs cor- responds to a formal register and is rather uncommon. In all usages, the direct

object marker ’et generally intervenes and itself takes the pronoun suffix. The pronominal pronoun suffixes are listed in the Appendix A.4. For a list of possible suffix functions for each lexical category, see Appendix A.5.

Ornan [95, p. 128] considers these pronoun suffixes to be inflections of the

base word. Ornan’s approach is supported by the need for repetition in coordinate structures, e.g., kaspam wuzhabam (their silver and gold). We chose to treat the pronoun suffixes as clitics, a decision which was also made by Netzer [87] for Hebrew text generation. These suffixes have low selectivity – can be attached to nouns, verbs, prepositions, and adverbs – and do not seem to have semantic idiosyncrasies. The only inflectional restriction on the base word is the construct state of the nouns suffixed by possessive pronoun. There are two main advantages to this decision:

1. According to our view, prepositions and adverbs are not considered to be strangely inflected by gender/number/person, but agglutinated to a clitic pronoun. In addition, there is no need for anomalous secondary inflection in order to explain the pronominal pronoun inflection of nouns and verbs.

2. From a computational point of view, this choice significantly decreases the size of the probabilistic model - see section 6.2 below.

2.4.5 Notation

The term token refers to the traditional ‘sequence of characters bounded with spaces’ definition, and the term word refers to any first-order inflected lexeme (excluding the pronoun suffixes and the formative letters).

The term token/word type denotes a form of token/word, where token/word instance refers to any instance of such token/word in a corpus.

27 2.5 Hebrew Lexical Categories

The notion of part of speech has a long tradition.9. The central parts of speech (verb, noun, and adjective) are part of the basic linguistic intuitions of all speakers. Surprisingly, however, a complete list of parts of speech for a given language is not well established. Beyond verb, noun, and adjectives, many other lexical units appear in text – and each raises potential questions as to what is meant by part of speech, what is the best way to label every unit in a document, and how to distinguish among the various labels. According to Ornan [92], the part of speech is an integral attribute of the morphology. Parts of speech were designed in order to categorize the type of the words. Ornan argues against the involvement of syntactic consideration in the definition of standard parts of speech sets in various languages. 10 Schwarzwald [106, volume 1, p. 100–109]11 presents the traditional list of nine parts of speech for Hebrew.12, which is equivalent to the classic list of Gesenius [46]:

• Nouns: common nouns, adjectives, pronouns, numerals.

• Verbs

• Particles: prepositions, conjunctions, interjections, adverbs.

As noted by Schwarzwald, the particles cover ‘closed sets’ of words (i.e., those which undergo few changes over time), where as nouns and verbs are open sets. Rosen [102, chapter 5] defines four categorical dimensions – person-sex-number,13 gender-quantity, case,14 and tense - in order to formalize thirteen word classes (i.e., parts of speech), as shown in Table 2.1. The finite verb category, for instance, is composed of all forms, inflected by person/sex/number/tense, in contrast to par-

ticiples which have no person dimension (but do have tense mark, e.g.,

9For a historical review see [38, p. 203–209]; for a discussion on the English parts of speech set, see [114, chapter 2]. 10For a discussion on POS categorization criteria, see [76, chapter 2]. 11See also [19, chapter 9]. 12For a description of this categorization in term of Goshen’s theory, which was mentioned above (2.3.2), see [106, volume 1, pp. 151–153]. 13See below.

14

i.e., prefixation of be (at), l (to), ’et, ‘al (on) etc. – see [102, section 7.1.4].

28 Part of speech person-sex-number gender-quantity case tense Finite X X Infinitive X Verb Gerund X X X Participle X X Appellative X X X Personal pronoun Noun X X Anthroponymic Toponymic X X Local-temporal X Adverb Other Verboid X X X Impersonal X Adjective X

Table 2.1: Word categorization according to Rosen’s four categorial dimensions.

hayordim huh. zru (the emigrant were returned)). Gerunds differ from infini-

tives due to their syntactic tense mark ( blekto qamah (at his coming she

woke up)), and the attachment to case prefixes ( capiti bredet hat.al (I watched the dew sprinkling)). Nouns can attach case morphemes, and have no tense dimension. Appellatives have gender/quantity dimension,15 and can be inflected by person/sex/number of the possessive pronoun suffix.16 Personal pronouns and anthroponymics are not inflected by gender/quantity, where toponimycs are not suffixed by possessive pronoun with person/sex/number. Adjectives are inflected by gender/quantity, with no case attachment. Imper- sonals have no inflection but tense,17 where adverbs can be subcategorized to

local-temporals which attach case morphemes, and all other adverbs that are

not inflected at all. Finally, verboids have all dimensions but case, e.g.,

hayah li, haya lka, hayah lahem, hayu lanu, yihyu lakem, yeˇslah, yeˇslo (I had, you had, they had, we had, you will have, she has, he has), etc.

The main contribution of Rosen’s elegant work is the positioning of the morpho- syntactic properties as the base criteria for lexical classification. Some of these criteria can be used for formalizing tests for resolving the part of speech of a word

15

e.g., yeled, yalda, yladim, yladot (boy, girl, boys, girls).

16

e.g., yaldi, yladeika, yaldan (my boy, your boys, their boy, etc.)

17

e.g., carik liˇson – hayah caric liˇson (should sleep – should slept).

29 in a given context. In the scope of building a morphological disambiguator, the following points should be taken into account:

1. Even though the dimensions suggested by Rosen are basically morphological, the implementation for lexical classification, is mostly syntactic. The tense

mark of impersonals, e.g., carik (should), for instance, is given by the syntactic structure of the whole phrase.

2. Some of Rosen’s definitions are not clear. The distinction between to- ponymics and anthroponymics seems to be unnatural. Why are anthro- ponymics considered to have person dimension and toponymics not? How do temporal-local adverbs attach a case morpheme?

3. An attempt to apply his method for the task of tagging a corpus exposed

some deficiencies. We found many adjectives that do attach case, e.g.,

la‘ayepim (to the tireds). Rosen may argue for a hidden noun, i.e.,

l(’anaˇsim) ha‘ayepim (to the tired (people)), but such a policy would make the analysis complicated. The lexical category of numerals – some are inflected by person/sex/number18 and some are not19 – is not clear. There is no part of speech for punctuations, interjections, date and time, foreign words, titulars, and URL addresses.

4. Rosen does not classify tokens but words, which are sometimes composed of

morphemes of several tokens, e.g., hayah lo (he had), yihyeh carik (should be). This could be problematic for traditional morphological analyzers, which are based on tokens.20

Several computational analyzers were developed in the past decade, such as [26], [108],21 [124],22 [29],23 and [113],24 (see section 2.8.3). The parts of speech set

18

ˇsnayim - ˇsneihem (two - two of them) 19 me’ah (one hundred) 20See 6.3.1 for a text representation which can flexibly handle such inter-token words. 21http://www.cs.technion.ac.il/∼erelsgl/bxi/hmntx/teud.html 22http://cl.haifa.ac.il/projects/hebmorph 23http://www.ravmilim.co.il 24http://mila.cs.technion.ac.il/website/english/resources/corpora/treebank

30 for each of these analyzers are summarized in Table 2.2. We return to this issue in chapter 5, discussing the design of our parts of speech set.

2.6 Hebrew Inflectional Properties

The term inflection denotes the modification of a stem form to indicate the gram- matical function a word fulfills. Inflections are usually based on affixation, inter- nal change, reduplication, and suppletion. For a detailed analysis of the differ- ent properties of derivation and inflection, with reference to Hebrew morphology, see [106, volume 2, pp. 85–155]. The Hebrew inflectional properties are: gender (masculine, feminine), num- ber (singular, dual, plural), person (first, second, third), tense (past, present, future, imperative, infinitive, bare infinitive), construct state (absolute, con- struct), and pronoun suffixes (possessive, accusative, nominative, and their person/gender/number inflections). These properties are traditionally described as affecting nouns/adjectives and verbs (even though some of them take part in preposition and adverb inflection as well). Verbs are inflected by number, person, gender,25 and tense, as described in the Appendix A.1. For details on the inflec- tional morphemes of these properties see [102, chapter 7], [4], [106, section 5.2, chapters 11,12], [48]. Nouns are inflected by gender, number, and construct state, as described in the Appendix A.2. For a list of the inflection domains for each lexical category, see Appendix A.5.

2.7 Morphological Analyzer

In Hebrew, morphological analysis requires complex processing according to the rules of Hebrew token formation. As mentioned above, the task of a morphological analyzer is to produce all possible analyses for a given word.

25Rosen [102, pp. 99–100] distinguishes between sex, which is merged with person, in the case of verbs, in contrast to gender, which relates only to quantity, as in the case of nouns.

31 Part of Speech HMA Segal Yona Rav Milim Treebank Noun X X X X X Pronoun X X X Xa Xb Proper Name X X X X X Adjective X X X X X Verb X X X X X Adverb X X X X Xc Preposition X X X X Xd Conjunction X X X X X Numeral X Xe Quantifier X X X Determiner X Aux Verb X X Interrogative X X X Xf Interjection X X X Particle X X Prefix X X Suffix X Negation X X Abbreviation X Punctuation X X X Foreign X X Existantial X Modifierg X Explanation X

aindependent,indefinitive,demonstrative,interrogative bWith two syntactic notions: PRP,AGR - see: http://mila.cs.technion.ac.il/ treebank/Decisions-Corpus1-5001.v1.0.doc cWith two syntactic notions: RB,RBR - Ibid. dWith three syntactic notions: IN,POS,AT - Ibid. ecardinal,ordinal,distributive fWith three syntactic notions: QA,HAM,WDT - Ibid. gUsed for general modifiers of nouns, adjectives, adverbs, prepositional phrases - Ibid.

Table 2.2: Parts of speech sets of various computational analyzers for Hebrew.

32 There are two known types of morphological analyzers in NLP systems: corpus (or heuristic) based, and knowledge (or dictionary) based. In a corpus-based analyzer, the set of possible analyses for each token is given by the tags that are attached to the token instances in the annotated corpus. The main disadvantage of this method is the analysis of unknown tokens – the analyzer cannot select possible analyses for a token which is absent from the training corpus. Moreover, even if the token does appear in the corpus, the analyzer can miss a required analysis which was not attached to any of the token instances in the annotated corpus. Knowledge-based analyzers are composed of a lexicon and a set of rules. Any instance of a form which has an entry in the lexicon can be analyzed according to the morphological rules. There are two main advantages to this method: (1) There is no need for an annotated corpus, which is an expensive resource; (2) The quantity of unknown words/analyses depends on the coverage of the lexicon – in most cases there are many fewer unknowns with this method. In our work, we rely on the morphological analyzer developed at the Knowledge Center (KC), which is knowledge-based.

The Lexicon

According to Spencer [115, pp. 47–49], the term lexicon simply means dictionary, and a dictionary is a list of words with their meaning, and other useful bits of linguistic information. There are several questions related to the nature of the linguistic lexicon. American structuralists, led by Bloomfield [20], consider the lexicon as con- taining only completely idiosyncratic information. Any property of a word that can be predicted from phonology or syntax is excluded from the lexicon. The lexicon contains a list of morphemes which can produce all the language words using word formation rules. The most obvious problem is that the meaning of a word is not always predictable from the meaning of its morphemes and, in some cases, the final pronunciation of a word cannot be predicted from the phonological

33 form of its component morphemes. In another approach the lexicon contains a list of complete words. But even the lexeme list leaves the lexicon potentially large. For example, recursive formation – the self feeding property of the compounding rule – can produce many compound nouns in English. These two approaches can be combined to define the lexicon as containing a list of morphemes and a list of words. The word list contains words which are formed by non-productive morphological processes, excluding those words that can be produced by productive processes, and whose meanings can be determined solely from the meaning of their components. Regularly inflected word forms would, therefore, not be listed, nor would regular nominalizations. According to Jackendoff [64], all words of a language are listed in the lexicon, whether or not they conform completely to the laws of form and meaning of words. The rules of morphology are conceived of as redundancy rules, by means of which the ‘cost’ of a lexical item is computed. Those that are totally predictable will have no cost. As noted by Ritchie et al. [101], practical implementations often rely on fairly ad hoc mechanisms, sometimes blurring important theoretical distinctions. Their computational analyzer for English focuses on designing such lexicon. For the case of Hebrew, Yona and Wintner [124,125] designed a lexicon as part of their analysis system. A description of this lexicon is given at [124, chapter 3].

Morphological Rules

Aronoff and Fudeman [7, p. 12] identify two approaches to morphological analysis: analytic and synthetic. Analytic approaches break words into minimal units, while synthetic approaches build words from minimal units. Yona and Wintner [124, 125] defined analytic morphological rules for Hebrew as regular expressions, over finite state technology. Such a design supports both analysis and generation of words by applying reversible rules on a lexicon. A full list of the rules is given in [124, chapter 4].

34 Har’el and Kenigsberg [61] designed a synthetic morphological analyzer based on inflection and prefixation rules for verbs, nouns, adjectives, and some less systematic classes (adverbs, prepositions, particles). The rules are applied on the lexemes of the lexicon in order to generate all the possible tokens of the language,26 which are stored in a map with the inflection and the prefixation rules that generated them. In such a way, analysis and generation of a token is simply a lookup operation over that map.

2.8 Morphological Disambiguator

As mentioned above, a token type without a context may be ambiguous, i.e., there

might be more than one analysis that fits the token. The Hebrew token bclm,

for instance, has 7 possible analyses, where the token hn‘ym has three, as shown in Table 1.1. The task of a morphological disambiguator is to pick the most likely analysis produced by an analyzer in the context of a sentence. The architecture of such system (w denotes tokens, t denotes analysis)is illustrated in Figure 2.8. The input of the disambiguator is an analyzed sentence, where each token is assigned to a set of possible (context-free) analyses. As an output, the disambiguator selects one analysis for each token in the sentence. Note that the disambiguator is free to select an analysis which is not suggested by the analyzer (see chapter 7).

2.8.1 Motivation

Morphological disambiguation is an essential component in many natural language processing (NLP) applications. An information retrieval system, for instance,

should find the correct part-of-speech of the Hebrew token h. oze, in order to index in term of contract (noun) or watch (verb). A text-to-speech system should

determine the gender of the Hebrew token ’iˇsah: feminine (a woman) or masculine (her husband). A machine translation system must find out whether

26They report about half a million generated token types.

35 Analyzer Disambiguator

w1 w1 {t1,1, ..., t1,k1 } w1 ti1 ......

wn wn {tn,1, ..., tn,kn } wn tin

Figure 2.1: Disambiguation process schema.

the tense of the Hebrew verb sprw is imperative (count!) or past (they cut a hair). In addition, morphological analysis can be used as a knowledge base input for other applications. A syntactic parser makes use of the lexical category of the tokens in the text. A noun phrase chunker may be interested with the construct state property of the words to be chunked. Word prediction can be more accurate if the morphological attributes of the previous words are taken into account.

2.8.2 Disambiguation Methods

How can one decide the correct analysis for a token in a given context? There are essentially two sources of information: the syntagmatic structural information – looking at the analyses of the other tokens in the context of the given token – and the lexical information which is given by the token and its possible analyses [77, 10.1]. The syntagmatic structural information seems to be the most obvious source of information for analysis, but it is not very successful. The early deterministic rule-based analyzer of Greene and Rubin [53] which used such information about syntagmatic patterns, correctly analyzed only 77% of words, which means that the lexical information is very important. The utility of this information was conclusively demonstrated by Charniak et al. [27], who showed that a ‘dumb’ analyzer for English, that simply assigns the most common analysis to each word,

36 performs at the surprisingly high level of 90% correctness. The conclusion of the above results is that information of both syntagmatic and lexical natures is needed for improving the analysis accuracy. For the rest of this work, the term tag will denote the morphological analysis of a token, and the term tagger will refer to a morphological disambiguator. As shown in Figure 2.8.2, most tagging algorithms fall into one of two classes: rule-based taggers and stochastic taggers. Rule-based taggers generally involve a large database of hand-written disambiguation rules. Such rules, for example, can specify that an ambiguous word is a noun rather than a verb, if it follows a determiner. Stochastic taggers, in contrast, generally resolve tagging ambiguities by using a training corpus for computing the probability of a word and a given tag in a given context [65, section 8.3]. Hidden Markov models [77, chapter 9, section 10.2], decision trees [105], neural networks [16], memory-based learning [32], and maximum entropy [99] are techniques used for taggers of this class. Brill’s Transformation-based tagger [23] combines both rule-based and stochastic taggers methods in order to learn the rules from a training corpus, based on the context of the word. A supervised tagger is an implementation of a tagging model that uses an analyzed corpus as a learning basis. Designing an unsupervised tagger is more challenging, and is necessary when such a corpus does not exist. Moreover, even if such a corpus is available, the ability of learning from new untagged text is necessary for handling the dynamic nature of a spoken language, as realized by the content of the corpus through time.

2.8.3 Disambiguation in Various Languages

English

In the case of English, because morphology is simpler, morphological disambigua- tion is generally covered under the task of part-of-speech tagging. The main morphological variations are embedded in the tag name (for example, Ns and Np for noun singular or plural).

37 Rule Based

Disambiguator Supervised Stochastic

Unsupervised

Figure 2.2: Disambiguator types.

Various lists of parts of speech have been used in various tagging projects for English, where the size of these tagsets range from 48 to 195. The Penn Treebank [79] reduces the 87 tags of Brown corpus [44] to 36 POS tags and 12 other tags. The reduced set leaves out information that can be recovered from the identity of the lexical item. Most tagging situations, however, do not involve parsed corpora and require a larger tagset. Corpora that aim to code more grammatical behavior use a much larger tagset, such as the Lancaster-Oslo/Bergen corpus (135 tags), the Lancaster UCREL group (about 165 tags), and the London-Lund Corpus of Spoken English (197 tags). The relation between these and some other tagsets is discussed in [45, Appendix B].

The tagging accuracy of supervised stochastic taggers is around 96%–97% [77, section 10.6.1]. Merialdo [81] reports an accuracy of 86.6% for an unsupervised token-based HMM, trained on a corpus of 42,186 sentences (about 1M words), over a tag set of 159 different tags. Elworthy [41], in contrast, reports accuracy of 75.49%, 80.87%, and 79.12% for unsupervised word-based HMM trained on parts of the LOB corpora, with a tagset of 134 tags. With good initial conditions, such as good approximation of the tag distribution for each word, Elworthy reports an improvement to 94.6%, 92.27%, and 94.51% on the same data sets. Merialdo, on the other hand, reports an improvement to 92.6% and 94.4% for the case where 100 and 2000 sentences of the training corpus are manually tagged.

38 Hebrew

Modern Hebrew is characterized by rich morphology, with a high level of ambiguity. On average, in our corpus, the number of possible analyses per known word reached 2.7, with the ambiguity level of the extended POS tagset in corpus for English (1.41), Dutch (1.29), German (1.87), French (1.7), Greek (1.5), Italian (1.72), and Spanish (1.25), as reported in [34]. In Hebrew, several words combine into a single token in both agglutinative and fusional ways. This results in a potentially high number of tags for each token.

In contrast to English tag sets, the number of tags for Hebrew, based on all combinations of the morphological attributes, can grow theoretically to about 300,000 tags. In practice, we found ‘only’ 3,561 tags in a corpus of news stories we gathered, which contains about 42M tokens.

Several worksin the past decade have dealt with Hebrew tagging.

Levinger et al. [74] developed a context-free method to acquire the morpho- lexical probabilities from an untagged corpus. Their method handles the data sparseness problem by using a set of similar tokens for each token, built according to a set of rules. The rules produce variations of the morphological properties of the word analyses. Their tests indicate an accuracy of about 88% for context- free analysis selection based on the approximated analysis distribution. In tests we reproduced on a larger data set (30K tagged tokens), the accuracy is only 78.2%. In order to improve the results, the authors recommend merging their method together with other morphological disambiguation methods – which is the approach we pursue in this work.

Levinger’s morphological disambiguation system [73] combines the above ap- proximated probabilities with an expert system, based on a manual set of 16 syn- tactic constraints. In the first phase, the expert system is applied, disambiguating 35% of the ambiguous tokens with an accuracy of 99.6%. In order to increase the applicability of the disambiguation, approximated probabilities are used for tokens that were not disambiguated in the first stage. Finally, the expert system is used again over the new probabilities that were set in the previous stage. Levinger

39 reports an accuracy of about 94% for disambiguation of 85% of the tokens in the text (overall 80% disambiguation). The system was also applied to prune out the least likely analyses in a corpus but without, necessarily, selecting a single analysis for each word. For this task an accuracy of 94% was reported, while reducing 92% of the ambiguous analyses.

Carmel and Maarek [26] use the fact that on average 45% of the Hebrew tokens are unambiguous, to rank analyses based on the number of disambiguated occurrences in the text, normalized by the total number of occurrences for each word. Their application – indexing for an information retrieval system – does not require all of the morphological attributes but only the lemma and the POS of each word. As a result, for this case, 75% of the tokens remain with one analysis with 95% accuracy, 20% with two analyses and 5% with three analyses.

Segal [108] built a transformation-based tagger in the spirit of Brill [23]. In the first phase, the analyses of each word are ranked according to the frequencies of the possible lemmas and tags in a training corpus of about 5,000 tokens. Selection of the highest ranked analysis for each word gives an accuracy of 83% of the test text – which consists of about 1,000 tokens. In the second stage, a transformation learning algorithm is applied (in contrast to Brill, the observed transformations are not applied, but used for re-estimation of the word couples probabilities). After this stage, the accuracy is about 93%. The last stage uses a bottom-up parser over a hand-crafted grammar with 150 rules, in order to select the analysis which causes the parsing to be more accurate. Segal reports an accuracy of 95%. Testing his system over a larger test corpus gives poorer results: Lembersky [72] reports accuracy of about 85%.

Bar-Haim et al. [103] developed a word segmenter and POS tagger for Hebrew. In their architecture, tokens are first segmented into Words and then, as a second stage, these words are tagged with POS. The method proceeds in two sequential steps: segmentation into words, then tagging over words. The segmentation is based on an HMM and trained over a set of about 80K annotated tokens. The segmentation step reaches an accuracy of 97.21%. POS tagging, based on unsu-

40 pervised estimation which combines a small annotated corpus with an untagged corpus of 340K tokens by using smoothing technique, gives an accuracy of 90.81%. Recently, Shacham and Wintner [111] developed a supervised morphological disambiguator for Hebrew, based on a combination of nine morphological classi- fiers.27 Each of the classifiers is trained on the output of the morphological ana- lyzer, in a window of ±3 tokens, over an annotated corpus of about 90K tokens. This method was implemented previously for Arabic by Habash and Rambow [56]. The main contribution of Shacham and Wintner is the sophisticated combination technique, based on a small set of six hand-crafted constraints. The system was trained and tested on 10-fold cross-validation over the annotated corpus, and achieved 91.44% accuracy for full morphological disambiguation.

Arabic

Arabic is a language with morphology quite similar to Hebrew. Theoretically, there might be 330,000 possible morphological tags, but in practice, Habash and Rambow [56] extracted 2,200 different tags from their corpus, with an average number of 2 possible tags per word. As reported by Habash and Rambow, the first work on Arabic tagging which used a corpus for training and evaluation was the work of Diab et al. [35]. Buckwalter [25] built a morphological analyzer for Arabic based on a lexi- con and a set of rules. Under the assumption that Arabic tokens are composed of prefix, stem, and suffix, the lexicon was designed to contain tables which de- fine the prefixes, stems, and suffixes with their lexical categories and their legal combinations. For a given token, the analyzer looks for any legal combination of prefix-stem-suffix, assigning their morphological category to the whole token. Habash [55] developed the ALMORGEANA analyzer, which uses Buckwalter’s lexicon, but produces an output in the lexeme-and-feature format rather than the stem-and-affix format of the Buckwalter analyzer. Habash and Rambow were the first to use a morphological analyzer as part

27POS, Gender, Number, Person, Tense, Definite Article, Status, Segmentation, and a binary classifier which denotes categories whose words have attributes (such as nouns and verbs).

41 of their tagger. They developed a supervised morphological disambiguator, based on training corpora of two sets of 120K tokens, which combines several classifiers of individual morphological features. The accuracy of their analyzer is 94.8% – 96.2% (depending on the test corpus). An unsupervised HMM model for dialectal Arabic (which is harder to tag than written Arabic), with accurracy of 69.83%, was presented by Duh and Kirchhoff [39]. Their supervised model, trained on a manually annotated corpus, reached an accuracy of 92.53%. Arabic morphology seems to be similar to Hebrew morphology in terms of com- plexity and data sparseness, but comparison of the performances of the baseline tagger used by Habash and Rambow – which selects the most frequent tag for a given word in the training corpus – for Hebrew and Arabic, shows some intriguing differences: 92.53% for Arabic and 71.85% for Hebrew. Furthermore, as mentioned above, even the use of a sophisticated context-free tagger, based on [74], gives a low accuracy of 78.2%. This might imply that, despite the similarities, morpho- logical disambiguation in Hebrew might be harder than in Arabic. It could also mean that the tag set used for the Arabic corpora has not been adapted to the specific nature of Arabic morphology (a comment also made in [56]).

Turkish

Turkish has an agglutinative morphology with productive inflectional and deriva- tional suffixation mechanism, i.e., suffixes can be added to a word in order to form a new one, with no restriction on this process 28. For example, the token sa˘glamlastirmak is actually a combination of the subtokens sa˘glam+las+tir+mak which means: to cause (something) to become strong or to strengthen (something). Theoretically, the number of word forms, one can derive from a given Turkish root is infinite. In practice, the number of inflections/derivations (#ID) per token (%T) is much lower, as can be shown in Table 2.3 (based on a corpus of one million tokens – tokens with seven or eight inflections/derivations do exist, but they are rare). Theoretically, there might be an infinite number of tags to characterize all

28In contrast to the limited pronominal pronoun suffix of Hebrew

42 #ID %T 1 72 2 18 3 7 4 2 5 1

Table 2.3: Distribution of inflections/derivations for Turkish.

possible tokens. In practice, Hakkani-Tur [58] extracted about 10K tags from a corpus of 1M tokens, with an ambiguity level of 1.75 (55% of the tokens have only one possible tag). In order to handle the high number of tags, Hakkani-Tur assumes that: (1) A root-word depends only on the roots of the previous words, and is independent of their inflectional and derivational productions; (2) When a word is considered as a sequence of inflections, syntactic relations only hold between the last inflection group of a (dependent) word and some (including the last) inflection group of the (head) word on the right. Based on these assumptions, Hakkani-Tur defined three supervised trigram language models:

Model 1 The presence of an inflection group in a word only depends on the final inflection group of the last two words.

Model 2 The presence of an inflection group in a word depends on the final inflection group of the previous two words and the previous inflection group in the same word.

Model 3 Same assumption as in model 2, except that the dependence on the previous inflection group in a word is assumed to be independent on the final inflection group of the previous words.

The three models were trained on a dataset consisting of the unambiguous sequences of 650K tokens of a daily newspaper 29, and two sets of manually dis- 29For example, given a sequence, consisting of six words: w1 w2 w3 w4 w5 w6, where w1 and w6 are ambiguous, only w2, w3, w4 and w5 (with one tag - t2, t3, t4 and t5 respectively) were considered for the estimation of P (t4|t2,t3) and P (t5|t3,t4).

43 ambiguated corpora of 12,000 and 20,000 tokens. The dataset used for evaluation consists of 2,763 tokens. ≈ 34% of the tokens were ambiguous, with overall ambi- guity of 1.53 analyses per token, after preprocessing30. For the baseline supertag model the tagging accuracy was 91.34%. The best accuracy – 93.95% – was achieved with model 1. The tagset for Hebrew is composed of ‘only’ 3,651 tags, but the ambiguity level is much higher – 55% of the tokens have more than one possible analysis, with average ambiguity of 2.7 (c.f., 34%, 1.53 for Turkish). The assumptions made by Hakkani-Tur in order to reduce the state-transition space seem not to be applicable for Hebrew, which restricts the affixation process (see Section 2.4)31. In contrast to Hakkani-Tur’s approach, in this work, we develop a general word- based representation of the text, which reduces the size of the tagset (instead of relying on the orthographic tokens, with a specific state-transition reduction method), and supports unsupervised learning of the probabilistic model. For a discussion on Asiatic languages tagging model see Section 6.3.4.

30Pre-processing consists of: (1) eliminating very rare root words that are ambiguous with a very frequent root word; (2) for known collocations, eliminating any word analyses that con- tradict the collocation; and (3) eliminating analyses of words that contradict the known post- position of the word that precedes it 31It seems that Dynamic Bayes Net (DBN) – see [84], would offer an excellent framework to model Hakkani-Tur’s method. In this directed graphical model, the tag of each token can be represented by a Bayes Net of inflection groups, enabling temporal state transitions (of adjacent tokens) between inflection groups instead of the sparse supertags.

44 Chapter 3

Objectives

Le rˆole du faiseur de puzzle est difficile

`ad´efinir...

Disambiguation System The purpose of our work is the development of a morphological disambiguation system for Hebrew. The task of a morphological disambiguation system is to pick the most likely analysis produced by an analyzer in the context of a full sentence. Recent Hebrew analyzers provide good performance and documentation of this process [43, 108, 125]. As mentioned above (section 2.8.3), several disambiguators were developed for Hebrew in the last decade: Levinger [73], Carmel and Maarek [26], Segal [108], Bar-Haim et al. [11], and Schacham and Wintner [111]. The results discussed above should be taken as rough approximations of the real performance of the systems, until they can be re-evaluated on a large scale corpus with a standard tag set. Anyway, the reported accuracy of these systems must be improved to a level of 95% POS tagging, and 90% accuracy for full morphological disambiguation.

Annotated Corpus As noted earlier, there is no large scale Hebrew annotated corpus. Our aim is to formalize a testing-standard wide-coverage corpus of about 200K tokens, which can also be used for training and research.

45 Tagset Design We realized that there is no agreement, among dictionaries and automatic tools, on the part-of-speech set for Hebrew. Preliminary attempts to annotate some Hebrew texts have shown that confusions and disagreement among dictionaries and human taggers, such as the distinction between prepositional phrases and prepositions, the distinction between adverbial phrases and adverbs, and the categorization of participles, modals, quantifiers, existentials and copula. As a first step, our main motivation is to design a tagset and formalize tagging guidelines to assist human taggers in achieving a high level of agreement. The annotation process should be applied iteratively, involving revisiting the tagset definition and the guidelines with reference to evidence found in the corpus.

Tagging Model As noted earlier (section 2.8.3), the number of tags we ex- tracted from a corpus of 42M tokens was 3,561 – about 30 times larger than the most comprehensive English tag set. The large size of such a tag set is problematic in terms of data sparseness. Each morphological combination appears rarely, and more samples are required in order to learn the probabilistic model. This large tagset set is caused by the affixational property of the Hebrew language which combines several words into one token (see section 2.4). We would like to face this data sparseness problem by formalizing a word-based model which covers, compactly, all essential morphological features. While we do develop annotated datasets, we are interested in a model that sup- ports unsupervised learning methods, since there is still not enough data available for supervised training. Unsupervised methods can handle the dynamic nature of Modern Hebrew, as it evolves over time.

Unknown Words Morphological analyzers rely on a dictionary, and their per- formance is, therefore, impacted by the occurrence of unknowns – tokens that cannot be analyzed by the morphological analyzer. These tokens can be catego- rized into two classes of missing information: tokens which are not recognized at all by the analyzer, and the cases where the set of analyses proposed by the ana- lyzer does not contain the correct analysis for a given token. The analysis of the

46 unknowns must be investigated, since they cover more than 7.5% of the corpus.

Qualities Criteria The system should be evaluated by its accuracy over the wide-coverage test corpus. In addition, we would like to evaluate the contribu- tion of the system to other applications which use the tagged text. For that we should implement some of these applications, such as NP chunker, Named Entity Recognizer, and Word Prediction system.

47 48 Chapter 4

Resources

L’art du puzzle commence avec les puz-

zles de bois d´ecoup´es `ala main...

4.1 Corpora

This section describes the various corpora we used as part of our work. The main part of the corpus consists of three raw text data sets extracted from stories of three daily newspapers, as published on the Web. In addition, we developed two sets of morphologically annotated sentences and one set of named entity annotations, as well as making use of the available Hebrew treebank.

A7 News and articles from the Arutz7 site1 (2001–2006). Most of the corpus is comprised of short news stories of between 200 to 1000 tokens each. The articles are written in a relatively simple style, with a high token/word ratio. The corpus contains about 15M tokens, and can be downloaded from the Hebrew Knowledge Center site.2

HR Parts of news and articles from Ha′aretz newspaper (1991). The corpus con- tains about 11M tokens, and can be downloaded from the Hebrew Knowledge 1http://www.inn.co.il/ 2http://www.mila.cs.technion.ac.il/english/resources/corpora/a7corpus/

49 Center site.3

TM Financial articles from The Marker newspapers (2002). The corpus contains about 700K tokens, and can be downloaded from the Hebrew Knowledge Center site.4

KN Sessions protocols of the Israeli parliament. The corpus contains about 15M tokens, and can be downloaded from the Hebrew Knowledge Center site.5

A7-T A sample of articles, comprising altogether 110,000 tokens, which was as- sembled at random from the A7 corpus. The articles were manually tagged by four taggers according to our tagging guideline [40] (see section 5.1 for a description of the tagging procedure).

HR-T Manually tagged version of the raw text which is used in the Hebrew Treebank. The corpus, consisting of about 90K tokens, was tagged by four taggers according to our tagging guideline [40] (see section 5.1, for a descrip- tion of the tagging procedure).

TB Hebrew Treebank of about 4,800 sentences (90K words) of news items, se- lected from the HR corpus, with full segmentation into morphemes and morpho-syntactic analysis. Morphological features that are not directly rel- evant for syntactic structures, such as roots, templates, and patterns, are not analyzed. The corpus is described at [113], and can be downloaded from the Hebrew Knowledge Center site.6

NE A set of newspaper articles in different fields: news, economy, fashion and gossip, taken from Hebrew newspapers. The corpus consists of about 57,000 words and 4,700 name expressions, which were manually classified into six categories, according to Named Entity tagging guidelines [15] (see section 9.2). The corpus can be found at: http://www.cs.bgu.ac.il/∼nlpproj/ naama/tagged corpus.txt. 3http://www.mila.cs.technion.ac.il/english/resources/corpora/haaretz/ 4http://www.mila.cs.technion.ac.il/english/resources/corpora/themarker/ 5http://www.mila.cs.technion.ac.il/english/resources/corpora// 6http://www.mila.cs.technion.ac.il/english/resources/corpora/treebank/

50 Corpus TInst TTp LTp TTags WTags Ambg A7 15,084,842 315,699 133,395 3,250 359 2.7 HR 11,098,678 300,213 123,678 3,011 352 2.67 TM 695,306 60,550 21,028 2,017 323 2.75 KN 15,092,312 196,702 49,532 3,034 345 2.69 Total 41,971,422 493,455 233,395 3,561 362 2.7

Table 4.1: Statistics of the raw-text corpora used for morphological analysis.

Corpus TInst TTp LTp TTags WTags Ambg A7-T 105,241 21,778 10,226 1,329 293 2.73 HR-T 89,337 23,638 11,394 1,429 300 2.69

Table 4.2: Statistics of the annotated corpora used for morphological analysis.

The statistical information of these corpora is summarized in Tables 4.1 and 4.2. TInst – number of token instances, TTp – number of token types, LTp – number of lemmas, TTags – number of token-based tags, WTags – number of word-based tags, Ambg – average number of analyses per token instance. We will refer to these sets later, as part of our work.

4.2 Morphological Analyzer

In this work, we use the official analyzer of MILA – Knowledge Center for Pro- cessing Hebrew7 (hereinafter KC analyzer). It is a synthetic analyzer, designed in the spirit of Har’el and Kenigsberg [61]. The analyzer is composed of two data resources – lexicon and a set of generation rules – and two programming tools – words generator and an implementation of a map interface. The words generator applies the generation rules over the lexicon lexeme entries in order to generate all possible words of the language.

Lexicon The Hebrew lexicon is based on the lexicon designed by Yona and Wint- ner [124, 125], and is implemented in XML according to an XSD schema.8

7http://yeda.cs.technion.ac.il:8088/XMLMorphologicalAnalyzer/ XMLOutputAnalyzer.html 8http://www.mila.cs.technion.ac.il/hebrew/resources/standards/hebrew lexicon

51 Part of Speech #Lexemes Adjective 2,518 Adverb 449 Conjunction 92 Copula 60 Existential 15 Interjection 54 Interrogative 18 Modal 44 Negation 8 Noun 11,389 Numeral 59 Pronoun 86 Preposition 117 Proper name 3,688 Quantifier 37 Title 30 Verb 4,750 Prefix 26 Total 23,350

Table 4.3: POS distribution of the lexicon entries.

52 Lexicon items are composed by their dotted and undotted Hebrew scripts9 as well as their latin transliteration. The other attributes of the lexicon items are determined by their POS type, and can be classified into inflection patterns features, and lexical sub-categorization (different types of numerals, conjunction, pronouns etc.). The POS distribution of the lexicon entries is summarized in Table 4.3.

Generation Rules The generation rules define the various ways of producing words from lexemes, by applying all possible inflections, according to the general characteristics of their lexical category, with reference to the specific attributes of each given lexeme. Documentation of the generation rules is given at [91].

9Script variations – formal, typo, colloquial, and slang – are supported.

53 54 Chapter 5

Tagset Design

l’objet vis´e– qu’il s’agisse d’un acte per- ceptif, d’un apprentissage, d’un syst`eme physiologique, ou, dans le cas qui nous occupe, d’un puzzle de bois – n’est pas une somme d’´el´ements, qu’il faudrait d’abord isoler et analyser, mais un ensemble, c’est-`a-dire une forme, une

structure.

The main morphological property of words is their lexical category – their part of speech. The central parts of speech (verb, noun, and adjective) are part of the basic linguistic intuitions of all speakers. However, as we were working on annotation of Hebrew text, we realized that a complete list of parts of speech is not well established, and that there is no agreement, among dictionaries and automatic tools, on the part-of-speech set for Hebrew. Beyond verb, noun, and adjective, many other lexical units appear in text, and each raises potential questions as to what is meant by part of speech, the best way to label every unit in a document, and how to distinguish among the various labels.

In contrast to a lexicographer who writes dictionaries, an annotator must as- sign a part-of-speech for all the words of a given text. The annotator cannot ignore foreign words, URLs, abbreviations, and misspelled words. The tagging process

55 may require identification of the words of the text, dealing with inter-token and multi-token words. Non-standard categories – such as titular, modal, participle, existential, and copula – are required for computational taggers in order to sup- port the requirements of natural language processing systems (parsers, semantic analyzers). The tags we suggest for Hebrew are: adjective, adverb, conjunction, cop- ula, existential, interjection, interrogative, modal, negation, noun, nu- meral, prefix, preposition, pronoun, proper name, quantifier, title, verb. The full description of these tags is given in our tagging guidelines [40]. Our main conclusion is that the tagset definition and tagging criteria cannot be imported from one language to another, nor can we rely on existing dictionaries. Instead, a tagset should be specifically defined over large-scale corpora of a given language, in order to tag all words with a high level of agreement. In this chapter, we present some of the issues we faced while designing a tagset for Hebrew, and illustrate the methodology through which we designed our tagset for Hebrew1.

5.1 Methodology

We used the following scheme while designing a tagset:

1. Use dictionaries of the language and/or import a standard tagset and tagging guidelines of another language (such as the English treebank).

2. Select a corpus.

3. Assign a group of annotators to tag the words in the corpus according to the guidelines.

4. Measure the agreement among the annotators; if high agreement was achieved – stop.

5. Identify the main disagreement factors.

1This chapter reports on joint work with Yael Netzer and David Gabay. Related publications include [40], [88], and [3].

56 6. Redefine the tagset and the tagging criteria.

7. Goto step 3.

We employed four students for the tagging task. An initial set of guidelines was first composed, relying on the categories found in several dictionaries and on the Penn Treebank POS guidelines [104]. As many words from the corpus were either missing or tagged in a non-uniform manner in the lexicons, we recommended looking up missing words in traditional dictionaries. Disagreement was also found among copyrighted dictionaries, both for open and closed set categories. Given the lack of a reliable lexicon, the taggers were not given a list of options to choose from, but were free to tag with whatever tag they found suitable. The process, although slower and bound to produce unintentional mistakes, was used for building a lexicon, and to refine the guidelines and on occasion modify the POS tagset. When constructing and then amending the guidelines we sought the best trade- off between accuracy and meaningfulness of the categorization, and simplicity of the guidelines, which is important for consistent tagging. Initially, each text was tagged by four different people, and the guidelines were revised according to questions or disagreements that were raised. As the guidelines became more stable, the disagreement rate decreased, each text was tagged by only three people and eventually two taggers and a referee who reviewed disagreements between the two. The disagreement rate between any two taggers was initially as high as 20% and dropped to 1% after a few rounds of tagging and revising the guidelines. Major sources of disagreements that were identified, include:

• Prepositional phrases vs. prepositions: In Hebrew, formative letters – b,c,l,m – can be attached to a noun to create a short prepositional phrase. In some cases, such phrases function as a preposition and the original meaning of the noun is not clearly felt. Some taggers would tag the word as a prepo-

sitional prefix + noun, while others tagged it as a preposition, e.g.,

b‘iqbot (following), that can be tagged as b-iqbot (in the footsteps of).

57 • Adverbial phrases vs. adverbs: The problem is similar to the one above,

e.g., bdiyuq (exactly), can be tagged as b-diyuq (with accuracy).

• Participles vs. adjectives: As both categories can modify nouns, it is hard to distinguish between them, e.g, mabat. m’ayem (a threatening

stare) – the category of m’ayem is unclear.

• Participles vs. nouns: e.g., the word hamictaynim in the following:

(5.1) hamictaynim balimudim yeqablu mlagot the-excellent in-the-studies will-receive scholarships Those who do well in their studies will receive scholarships.

• Modality: A set of words that expresses modality and commonly appears be- fore verbs in the infinitive. Such words were tagged as adjectives or adverbs, and the taggers were systematically uncertain about them.

Beside the disagreement among taggers, there was also significant disagree- ment among Modern Hebrew dictionaries we examined, as well as computational analyzers and annotated corpora. For example, in Tables 5.1, the various selected POS tags for modal words are listed, as determined by: (1) Rav Milim [30], (2) Sapir [9], (3) Even-Shoshan [42], (4) Knaani [66], (5) HMA [26], (6) Segal [108], (7) Yona [124], (8) Hebrew Treebank [113]. As can be seen, eight different POS tags were suggested by these dictionaries: adJective (29.6%), adveRb (25.9%), Verb (22.2%), Auxilary verb (8.2%), Noun (4.4%), parTicle (3.7%), Preposition (1.5%), and Unknown (4.5%). The average number of options per word is about 3.3, which is about 60% agreement. For none of the words there was a comprehensive agreement, and the POS of only seven words (43.75%) can be determinded by voting (i.e., there is one major option). The tagset design methodology we present is corpus-based and application- oriented. For each tag, the criteria and tests that determine whether a token should be tagged come from corpus observations. Our main objective is to en- sure consistency, coverage, and tagging agreement. The tag criteria are basically

58

Word Example 1 2 3 4 5 6 7 8

yeˇs yeˇsla´sim leb lanisuh. R N N R N A R V

should Attention should be paid to the wording

N ’ein ’ein la´sim leb lanisuh R U U P P R V . R

shouldn’t Attention should not be paid to the wording

h. ayab hacibur h. ayab lhabin ’et ha‘inyan J J J J J J J V

must The public should be made aware of this issue

mutar mutar lah lacet lt.iyul R N J R J A V J

allowed She is allowed to go on a trip

J ’asur ’asur lah lacet ltiyul byom riˇson R R R R J A J . V

forbidden She is not allowed to go on a trip on Sunday

’epˇsr ’epˇsr lirmoz rmazim U R R R T A R V

may Giving hints is allowed

’amur naˇsim ’amurot lilboˇsr‘alot J A J J J A J V

supposed Women are supposed to wear veils

carik bmw”m carik la‘amod ‘al ˇselka J J R J J A J V

should In negotiation you should keep strong

nitan nitan liptor b‘ayah mak’ibah zo U V V V V V V V

can This troublesome problem can be solved

J ‘alul hakeleb ‘alul linˇsok J J J J A J V N

may The dog may bite

kda’y kda’y liˇs’ol ha’im hadelet ‘a´suyah heit.eb R R R R J A R J

worthwhile It is worth asking whether the door is well built

mutab mutab lihyot beˇseqet. wulhnot R R R R T T V V

better Better to keep quiet and enjoy

J msugal hu’ hayah msugal lir’oto babait halaban J R J J A J V V

able He could envision him sitting in the White House

yakol ’anaˇsim ykolim litrom trumot V V V J V A V V

can People can make contributions

V V ’ikpat ’ikpat lka laleket? U U T T R V R R care/mind Do you mind going Table 5.1: Parts of speech of selected modals in various dictionaries.

59 morpho-syntactic, with semantic considerations and practical deliberation. Each tag is measured by its usage impact on the disambiguation process, and on appli- cations which make use of the morphological properties. In the following sections we demonstrate the use of our methodology for the definition of four categories in Modern Hebrew – modals and participles, prepositions, and adverbs.

5.2 Modality

5.2.1 Modality in Hebrew

We investigate the existence of a modal category in Modern Hebrew by analyzing the characteristics of a class of words, from morphological, syntactic, semantic, and practical points of view. The decision as to whether to introduce a modal tag in a Hebrew tagset has practical consequences: we counted that over 3% of the tokens in our 27M token corpus can potentially be tagged as modals. Beyond this practical impact, the decision process illustrates the relevant method through which our tagset can be derived and fine tuned. Semantically, Modus is considered to be the attitude on the part of the speaking subject with regard to its content [37], as opposed to the Dictum, which is the lin- guistic realization of a predicate. While a predicate is most commonly represented with a verb, modality can be uttered in various manners: adjectives and adverbs (definitely, probable), using thought/belief verbs, mood, intonation, or with modal verbs. The latter are recognized as a grammatical category in many languages (modals), e.g., can, should, and must in English. From the semantic perspective, modality is coarsely divided into epistemic modality (the amount of confidence the speaker holds with reference to the truth of the proposition) and deontic modality (the degree of force exerted on the subject of the sentence to perform the action) views [33]. Modal expressions do not constitute a syntactic class in Modern Hebrew [69]. In her work, Kopelovich reviews traditional descriptive publications on the syntax of Hebrew and claims that these works (Ornan, Rubinstein, Azar, Rosen, Aronson-

60 Berman, and Maschler)2 do not provide a satisfactory description or explanation of the matter. In this section we review three major approaches to modality in Hebrew – the first is semantic (Kopelovich), the second is semantico-syntactic (Zadka), and the third is purely morphologico-syntactic (Rosen). Kopelovich provides three binary dimensions that describe the modal system in Hebrew: Personal – Impersonal, Modality – Modulation, and Objective – Sub- jective. The Personal–Impersonal system is connected to the absence or presence

of a surface subject in the clause. A personal modal has a grammatical subject:

(5.2) dawid carik lhasi‘ ’et ’imo David should to-drive ACC mother-POSS David should drive his mother

An impersonal modal has no grammatical subject, and modality predicates

the entire clause.

(5.3) carik lhasi‘ ’et ’imo la‘abodah should to-drive ACC mother-POSS to-the-work His mother should be driven to work

Kopelovich makes no distinction between the various syntactic categories that the words may belong to, and interchangeably uses examples of words like

mutar, yeˇs, ’epˇsar [adverb, existential, participle respectively]. The Modality-Modulation plane, according to the functional school of Halliday [59], refers to the interpersonal and ideational functions of language: Modality

expresses the speaker’s own mind (epistemic modality – possibility, probability,

and certainty) ‘alul laredet geˇsem mah. ar (it may rain tomorrow).

Modulation participates in the content clause expressing external conditions in the

world (deontic modality – permission, obligation, ability, and inclination): ’ata yakol lhath. il ‘akˇsaw (you can start now). Modality does not carry 2For reference see [69]; see below for Rosen’s analysis.

61 tense and cannot be negated, while modulation can be modified by tense and can be negative. The Objective–Subjective plane is what Kopelovich calls the perception of the world. Objectivity is achieved using different tenses of to-be in conjunction with the modal (including the tense of the modal if it is a verb), and their order sub-

jective vs. objective:

(5.4) dawid haya carik lisw‘ ltel ’abib David was have to-drive to-Tel Aviv

David had to drive to Tel Aviv

(5.5)

kdei lha‘abir ’et hahah. lata, carik haya lkanes ’et kol ha‘obdim In-order to-pass ACC the-decision, should to-assemble ACC all the-employees In order to obtain a favorable vote on this decision, all of the employees had to be assembled.

Zadka [128] defines the class of single-argument bridge verbs,3 i.e., any verb or pro-verb that can have an infinitive as a subject or object, and that does not

accept a subordinate clause as subject or object: (5.6) [subject] ’sur l‘aˇsen Forbidden to-smoke

It is forbidden to smoke

(5.7) [object]

hua racah/hth. il lˇsah. eq He wanted/started to-play He wanted/started to play

3

‘Ride Verb’ in Zadka’s terminology, .

62

(5.8)

Yosep hith. il/’amad/msugal liqro’ ’et hado”h. bimlo’o. Yosef began/is-about/is-capable to-read ACC the report entirely.

Yosef began/is-about/(is-capable) to read (of reading) the report entirely.

(5.9) *

*Yosep hith. il/’amad/msugal ˇsiqra’ ’et hado”h. bimlo’o. *Yosef started/was-about/is-capable that-he-read ACC the report entirely.

Zadka classifies these verbs into seven semantic categories: (1) Will, (2) Manner, (3) Aspect, (4) Ability, (5) Possibility/Certainty, (6) Necessity/Obligation, and (7) Negation. Categories 1, 4, 5, 6, and 7 are considered by Zadka to include pure modal verbs, e.g., alethic and deontic verbs. In his paper, Zadka defines classification criteria that refer to syntactic-semantic properties such as: can the infinitive verb be realized as a relative clause, are the subject of the verb and its complement the same, can the infinitive be converted to a gerund, animacy of subject; deep semantic properties – argument structure and selectional restrictions, the ability to drop a common subject of the verb and its complement, factuality level (factual, non-factual, counter-factual), and morphological properties. Will, Manner, and Aspectual verbs as Zadka defines them are not considered

modals by Kopelovich since they can be inflected by tense (with the exceptions

of ’amur (supposed), ‘atid (should)). Ability verbs are yakol (can)

and msugal (can,capable) [participle]. They have both an animate actor as a subject and an infinitive as a complement, with the same deep subject. These

verbs are counter-factual.

Certainty verbs include mukrak (must), carik (should), ne’elac

(be forced to), yakol (can), hekreh. i (necessary), ‘a´suy (may),

‘alul (might), capuy (expected). They represent the alethic and epistemic necessity or possibility of the process realized in the clause. All of them cannot be inflected morphologically. The modal predicates the whole situation in the

63 proposition, and may be subjective (epistemic) or objective (alethic). The subject

of these verbs coreferences with the subject of the modal:

(5.10) ’ani mukrak liqnot mkonit I must to-buy car

I must buy a car

Necessity/Obligation includes adjectives – e.g., h. ayab (must), raˇsa’y

(allowed), gerunds – mukrah. (must), ’asur (forbidden), mutar 4 (allowed), and the verb yakwl (can). Necessity verbs/pro-verbs present deontic modality, and all clauses share – in Zadka’s view – a causing participant that is not always realized in the surface. It appears that Zadka’s classification relies primarily on semantic criteria.

From the morphological point of view, one may characterize impersonals by

a non-inflectional behavior. The words yeˇs, ’ein, mutar, ’asur, ’epˇsar, ’ikpat do not inflect in number and gender with their argument.

But this criterion leaves out words that are classified as modals by Zadka, but

do have gender-number inflections, e.g., ra’uy, msugal, ‘alul, ’amur, carik, yakol. On the other hand, extension of the modal class with all the gender-number inflected words, which are complemented by an infinite or

a relative clause, will include clear adjectives ( musmak (certified)), nouns ( zkut (credit)), and participles ( nimna‘ (avoid)), as well. Rosen [102, pp. 113–115] defines a syntactic category he calls impersonals. Words in this category occur only as the predicative constituent of a sentence with an infinitive or a subordinate clause argument. Past and future tense is

marked with the auxiliary verb hayah (to-be). In addition, impersonals cannot function as predicative adjectives: kda’i (worthwhile), mutab (better),

’ikpat (care/mind). Personal reference can be added to the clause (governed

by the infinitive) with the l dative preposition:

4 As well as nouns and prepositions – among them and yeˇs and ’ein - according to Zadka.

64

(5.11) kda’y li liˇstot worthwhile to-me to-drink It is worthwhile for me to drink

We have reviewed three major approaches to categorizing modals in Hebrew: Semantic – represented mostly in Kopelovich’s work, modality is categorized by three dimensions of semantic attributes. Since her claim is that there is no syntac- tic category of modality at all, this approach ‘over-generates’ modals and includes words that from any other syntactic or morphologic view fall into other parts of speech. Syntactic-semantic – Zadka classifies seven sets of verbs and pro-verbs following syntactic and semantic criteria. His claim is that modality actually is marked by syntactic characteristics, which can be identified by structural criteria. However, his evaluation mixes semantics with syntactic attributes. Morphological-syntactic – Rosen’s definition of Impersonals is strictly syntactic- morphological and does not try to characterize words with modality. Conse-

quently, words that are usually considered modals are not included in his defi-

nition, such as ’asur (forbidden), mutar (allowed), yakol (can).

5.2.2 Proposed Modal Guidelines

The variety of criteria proposed by linguists reflects the disagreements we identified in lexicographic work about modal-like words in Hebrew. For a computational application, all words in a corpus must be tagged. Given the complex nature of modality in Hebrew, should we introduce a modal tag in our tagset or, instead, rely on other existing tags? We have decided to introduce a modal tag in our Hebrew tagset. Although there is no distinct syntactic category for modals in Hebrew, we propose the following criteria: (i) They have an infinitive complement or a clausal

complement introduced by the binder ˇs. (ii) They are NOT adjectives. (iii) They have irregular inflections in the past tense, i.e., raciti lada‘at (I wanted to know) is not a modal usage.

65

The tests to distinguish modal from non-modal usages are: • and which can be also existential, are used as modals if they can be

replaced with . • Adjectives are gradable and can be modified by m’od (very) or

yoter (more).

• Adjectives can become describers of the nominalized verb: ⇒ qal laharos ⇒ haharisah qala m’od (easy to destroy ⇒ the destruc- tion is easy).

• In all other cases where a verb is serving to convey modality, it is still tagged

as a verb, e.g., muban ˇsyosi hu’ hamnaceh. (it is clear that Yossi is the winner).

We first review how these guidelines help us address some of the most diffi- cult tagging decisions we had to face while building the corpus, we then indicate quantitatively the impact of the modal tag on the practical problem of tagging.

Care

One of the words tagged as a modal in our corpus – the word ’ikpat – is traditionally not considered to be a modal. However, at least in some of its instances, it fits our definition of modal, and it can also be interpreted as modality according to its sense. The only definition that is consistent with our observation is Rosen’s impersonals. Looking back at its origins, we checked the Historical Lexicon of the Hebrew

5 Language, the word was used in the medieval period in the Talmud and the

Mishna, where it only appears in the following construction:

(5.12) mah ’ikpat lk

5http://hebrew-treasures.huji.ac.il/ an enterprise conducted by the Israeli Academy of the Hebrew Language.

66 what care to-you what do you care

Similarly, in the Ben Yehuda Project – an Israeli version of the Guttenberg project6 which includes texts from the Middle Ages to the beginning of the 20th century – we found 28 instances of the word, with the very same usage as in older times.

While trying to determine its part of speech, we do not identify as a NOUN

7 8 – as it cannot have a definite marker , and is not an adjective.

Traditional Hebrew Dictionaries consider to be an intransitive verb [9,42, 68] or an adverb. Some dictionaries from the middle of the 20th century [54, 66], as well as recent ones [30] did not give it a part of speech at all. In our corpora ([A7], [HR], and [TM] – 27M tokens of newspaper) we found

130 occurrences of the word of which 55 have an infinitive/relative clause

complement, 35 have null complement, and 40 have a m PP complement

’ikpat lo mehamdina (he cares for the country). The latter has no modal

interpretation. We claim that in this case it should be tagged as a participle ( ). The test to distinguish modal and participle is the possibility of rephrasing the clause into an adjective phrase. In contrast to adjective participles, modals do not

modify the noun or the subject of the activity, but refer to the whole action.

(5.13) ⇒ * [Modal]

ikpat lo liˇstop kelim ⇒ *hu’ ’ikpati klapei ˇst.ipat kelim mind him to-wash dishes ⇒ *he concerned for washing dishes

He minds washing dishes ⇒ *He is concerned about washing dishes

(5.14) ⇒ [Participle] ’ikpat lo meha‘aniyim ⇒ hu’ ikpati klapei ha‘aniyim care him of-the-poor-people ⇒ he caring for the-poor-people He cares for the poor people ⇒ He is caring for the poor people

6http://www.benyehuda.org, http://www.gutenberg.org

7

Although we found in the Internet clauses such as nistmu li naqbubiyot ha’ikpat (My caring pores got blocked).

8 Only its derivatives ’ikpati, ikpatiyut (caring, care) allow adjectival usage.

67 As for the occurrences with infinitive/relative clause complements, all other tests for modality hold: (1) infinitive/relative clause complement, (2) not an ad- jective, (3) irregular inflection (no inflection at all). To conclude this section, our proposed definition of modals allows us to tag this word in a systematic and complete manner and to avoid the confusion that characterizes this word.

Hard

Some of the words tagged as modals are commonly referred to as adjectives, such as ’asur, mutar (allowed, forbidden). However, questions are raised of

how to tell apart such modals from adjectives that show very similar properties:

qaˇse li laleket (it is hard for me to walk). Ambar [5] analyzes the usage

of adjectives in modal contexts, especially of ability and possibility. In sentences

such as qaˇse lanu lhistagel lara‘aˇs (it is hard for us to get used to the noise) the adjective is used in a modal and not an adverbial meaning, in

the sense that the meaning of the adverbial bqwˇsi (with difficulty) and the

modal yakwl (can) are unified into a single word . Similarly, the possibility sense of is unified with the modal ’epˇsar. In any usage of the adjective as the modal, it is not possible to rephrase a clause in a way that the adjective

modifies the noun, i.e., the range is the action itself and not its subject.

(5.15) qaˇse lbace‘ ’et haheskem hard to-perform PREP the-agreement

It is hard to perform the agreement (5.16) * *haheskem kaˇse *the-agreement hard

*The agreement is hard However, following Ambar, there are cases where the usage of qaˇse le is not modal, but an emotional adjective:

68

(5.17)

qaˇse/na‘im lˇsoh. eh. ’ito hard/pleasant to-chat with-him It is hard/pleasant to chat with him

Berman [18] classifies subjectless constructions in Modern Hebrew, and dis- tinguishes what she calls dative-marked experientials where (mostly) an adjective

serves as a predicate followed by a dative-marked nominal

(5.18)

qaˇse le-rinah bah. ayim hard for-Rina in-the-life It is hard for Rina in life

Adjectives that allow this construction are circumstantial and do not describe

an inner state: rina ‘acuba (Rina is sad) vs. ‘acub lrina (it is sad for Rina). Another recognized construction is the modal expressions that

include sentences with dative marking on the individuals to whom the modality is

imputed ’asur lanu ldaber kakah (forbidden us to-talk here – we are not allowed to talk like this); Berman suggests that the similarity is due to the perception of the experiencer as recipient in both cases; This suggestion implies that Berman does not categorize the modals (’asur, mutar) as adjectives. Another possible criterion to allow these words to be tagged as modals (fol-

lowing Zadka) is the fact that for Necessary/Obligation modals there exists an

‘outside force’ which is the agent of the modal situation. Therefore, if

’asur lanu ldaber kakah (we are not allowed to talk like this), this is because

someone forbids us from talking, while if qaˇse lrinah bah. ayim (It is hard for Rina in life) then no ‘outside force’ is obliged to be the agent which makes her life hard. To conclude – we suggest tagging both ’asur and mutar as modals, and we recommend allowing modal tagging for other possible adjectives in this syntactic structure.

69 5.2.3 The Importance of Tagging Modals

We recommend the introduction of a modal POS tag in Hebrew, despite the fact that the set of criteria to identify modal usage is a complex combination of

syntactic and morphological constraints. The lexemes with modal type, found in

our corpus are:

’ikpat,

’ein, ’epˇsar, ’asur, day, yeˇs, kdai, h. abal, h. obah, mutab, ’amur, mukan, msugal, nitan, ‘alul, ‘a´suy, raˇsay, zakay, h. ayab, h. aˇsub, carik, ra’uy, zaquq, mukrah. , racuy, nitan, yakol. This class covers as many as 3% of the tokens observed in our corpus.

Our main motivation in introducing this tag in our tagset is that the alternative (no modal tag) creates confusion and disagreement: we have shown that both traditional dictionaries and previous computational resources had a high level of disagreement over the class of words we tag as modals. We have confirmed that our guidelines can be applied consistently by human taggers, with agreement level similar to the rest of the tokens (over 99% pairwise). We have checked that our guidelines stand the test of the most difficult disagreement types identified by taggers, such as ‘care to’ and ‘difficult for’.

The immediate context of modals includes a high proportion of infinitive words. Infinitive words in Hebrew are particularly ambiguous morphologically, because

they begin with the letter l which is a formative letter, and often include the

analysis le+ participle, e.g., can be interpreted, depending on context, as liˇsmwr (to guard), le-ˇsamur (to a guarded), or la-ˇsamur (to the guarded). Other

ambiguities might occur too, e.g., can be interpreted as laˇsir (to sing), le- ˇsir (to a song), or as la-ˇsir (to the song). We have measured that on average, infinitive verbs in our expanded corpus can be analyzed in 4.9 distinct manners, whereas the overall average for all word tokens is 2.7. The identification of modals can serve as an anchor which helps disambiguate neighboring infinitive words.

The impact of the modal tag should be measured by two criteria: accuracy of the morphological disambiguation, and its usage for other applications, such as noun phrase chunking. We intend to perform this detailed analysis in the future.

70 5.3 Beinoni

As noted by Gesenius [46, p. 355], beinoni forms occupy a middle place between the noun and the verb. Morphologically, they are simple nouns, i.e., they carry gender, number, and status inflections, definiteness, affixation, and no person and tense/mood relation. From the semantic point of view, according to traditional linguistics such as Gesenius [46], Hebrew participles are not representing a fixed state, but some source of action or activity, in contrast to nouns and adjectives (a claim which is not supported by nominalization). In the following sections, we discuss the need for a special lexical category for participles, its characteristics, and its implementation.

5.3.1 Beinoni in Hebrew

We use the term beinoni to denote various forms of Hebrew tokens:

1. ’Present verb like’ forms, with optional w,ˇs,h prefixes:

• h. olmim (dreaming/dreams)

• ˇsamur (being guarded/is guarded)

• ˇseniˇsmarot (that are being guarded/that are guarded )

• hah. olemet (that are dreaming/that dreams)

• mˇsumar (is being conserved/is conserved/a conserve)

• weˇsomrot (and are guarding/and guard/and guards)

2. ’Present verb like’ forms, with b,k,l,m:

• bˇsomrim (at guards/at (those that) guards)

• bˇsamur (at (those that) are guarded)

• laniˇsmarot (to (those that) are guarded)

• kmˇsameret (as preserves/like a sifter)

• mhamˇsumar (of the (one that) is conserved/of the conserve)

71 • wlaˇsomrim (and for (those that) are guarding/and for the guards)

3. Construct state forms of nouns and adjectives, including those which are not part of the lexicon:

• ˇsomrei (the guards/(those that) guard)

• ˇsomeret (the guard/the (one that) guards)

4. Noun and adjective forms, including those which are not part of the lexicon, with pronominal suffix:

• ˇsomrab (his guards/(those that) guard him)

There are four possible tags for these forms: present verb, participle, noun, and adjective. Each form may be tagged by different subsets of these tags:

1. Verb in present tense (with relativizer/subordinate conjunction), fixed noun or adjective – in case there is a lexicon entry for such noun or adjective – (with definite article), participle (with definite article or relativizer/subordinate conjunction).

2. Fixed noun or adjective (in case there is a lexicon entry for such noun/adjective), participle.

3. Noun or adjective in construct state (in case there is a lexicon entry for such noun/adjective), construct state of participle.

4. Noun or adjective with possessive pronoun (in case there is a lexicon entry for such noun/adjective), participle with accusative pronoun suffix.

5.3.2 A Lexical Category for Beinoni

For the purpose of this work, we are interested in the classification of the above beinoni forms. Rosen [102, pp. 106–107] argues for a participle category, which cover the participle and present verb forms. Blao [19, p. 186], on the other hand,

72 treats the participle forms, which are not disambiguated according to the tests to either noun or adjective, as verbs.

A similar argument can be found in modern analyzers: Rav Milim and Yona have no participle category, i.e., all the verbal interpretations are classified as verbs

with a beinoni tense, which is the tense of the present forms as well, e.g.,

ˇsomrim, mgulgalim (are guarding, are being rolled). The participles are

classified in the lexicon into three categories: (1) ’exclusively’ nouns/adjectives,

with no possible verbal analysis, e.g., tapur, mlumad, soper (sewn,

scholarly, writer), (2) nouns and adjectives, which have a verbal interpretation as

well, e.g., mgulgal, ˇsomer (rolled/is rolled, guard/guards, ter-

rorist/sabotages), (3) exclusively verbal forms, e.g., is broadcast, counts, curses. The KC analyzer (see section 4.2) defines a participle category, generally composed of the beinoni tense verbs of Rav Milim and Yona.

This categorization decision is related to several issues: can the list of nouns and adjectives in the lexicon be extended by all participle forms? What is the

correct reading of the h prefix of the above beinoni forms: definite article or relativizer/subordinate conjunction (see Rosen [102, pp.107, footnote 92])? Is there a conceptual difference between participle tense and present tense? Is there a hidden person mark for present and participle verbs? How do participles relate to generics formation in Hebrew?

In the following sections, we analyze this question from the lexical, morpholog- ical, syntactic, semantic, and practical points view. We conclude that a participle category should be defined. In contrast to the KC analyzer, we propose that present verbs should be assigned to the verb category. In our analysis, we con- clude that a distinct participle category should be defined. In contrast to the KC analyzer, we propose that present verbs should be assigned to the verb category, and distinguished from participles.

73 Lexicon

We compare the set of possible proposed analyses by the same eight dictionaries (see above). We found significant differences as shown in Table 5.2. The differences are mainly lexical, i.e., the noun and adjective lists are diverse from one lexicon to another. For a given lexicon, we suggest applying the verb-noun and verb-adjective tests, presented in Appendix B.1, for any verb entry in the lexicon, in order to conclude whether the given lexeme forms an additional noun and/or adjective entry.

Morphology

From the morphological point of view, the beinoni forms are inflected and af- fixed as nouns, as shown in Table 5.3.9 According to Rav Milim, which has no beinoni category, the verb category contains tokens with status property, as well as possessive pronoun suffixation. The KC analyzer, on the other hand, combines participles and present verbs, which have a different affixation mechanism and status marking, under the same participles category.

Syntax

From a syntactic point of view, certain noun/adjective beinoni forms, cannot be considered as verbs nor as nouns/adjectives.

Noun/adjective usages that cannot be considered as verbs

Tense Affinity Noun/adjective usages have no tense affinity, in contrast to present verbs [19, p. 186]. The same surface form (beinoni) can be used as a noun/adjective or as a present verb. How can we distinguish between these two usage types? Present verb usages are bound to present tense, while noun/adjective can occur in any tense context. Aspect is not relevant to

9One might distinguish morphologically between nouns and adjectives by the existence of a possessive pronoun suffix, which is somehow garbled for adjectives, e.g., tpuraw, ‘acubeyah (his sewns, her sadness), but we decided to consider such constructions as adjectives and not as nouns - see Appendix B.1.3.

74

Word Example 1 2 3 4 5 6 7 8

N N N A N ’ahub zer dpanim lgibor - ’ahub A N N V A A N V

beloved a garland of laurels for a beloved hero

V ’amur hadabar ’amur bmiˇsne toqep A A V A A A V X

shouldn It is said with strength

N N N N ’aˇsem ’ulay ’aˇsem hamedyum hatelewizyoni A V A A A . A A A

guilty maybe, the television medium is guilty V

N N A N N btelah btelah mehoser samkut V A A V . A V A A

is cancelled is cancelled due to lack of authorization V

A A A A A bimˇsutap hi’ hudrkah bimˇsutap ‘al yedey kamah gupim A A N V V V V V

in common she was guided by several groups together

A yaˇsub yaˇsub b’ah. uzto haypeypiyah A A A A V A V R

seated seated in his lovely estate

N N N N N N maziqim yeˇslid’og l‘iˇsun neged maziqim A A A A V A V A

pests smoking against pests should be applied V V V V

N A A A N A hamukˇsarim no‘adu lˇsahrer ’et hamukˇsarim minetel A V N . V V V V V

the talented intended to release the burden from the talented V

N N A A A A hanimna‘ zeh lo‘ min hanimna‘ A A V N V V V V

avoidable it is possible that V V

A A A mˇsulal hakoteb mˇsulal habanah taqtit A A A N A . . V V V

bereft the writer is bereft of any tactical knowledge

N A N N N N pcu‘ah hi’ ˇsakbah pcu‘ah qaˇse broˇsah A V A V A A V A

wounded she was lying seriously wounded

N N N ˇsobeh seper ˇsobeh leb V V V N A V V V

captures an alluring book

yadu‘ ki hakol hayah ˇseqer it is known that nothing was true N A N N

yadu‘ A A A A A V A V known yadu‘ah bacibur known in public

Table 5.2: Suggested POS lists for selected participle forms in various dictionaries.

Gender Number Status w ˇs h bklm suffix Noun V V V V V V Adjective V V V V V X Present Verb V V X V X V Beinoni V V V V V V

Table 5.3: Morphological classification of participle forms.

75 this distinction – beinoni in verbal usage can denote both progressive and simple tenses (in contrast to the English present participle which is bound to the progressive aspect).

The following examples indicate simple syntactic tests that distinguish be- tween verbal and noun/adjective usages:

[verbal usage of beinoni form: present progressive]

(5.19) ⇒ *

hah. ayalim mit’amnim ‘akˇsaw ⇒ *hah. ayalim mit’amnim ’etmol the-soldiers that-are-training now ⇒ *the-soldiers that-are-training yesterday the soldiers that are training now ⇒ *the soldiers that are training yesterday

[verbal usage of beinoni form: present simple]

(5.20) ⇒

*

hah. ayalim mit’amnim byamim ’elu ⇒ *hah. ayalim mit’amnim bayamim hahem the-soldiers train at-days these ⇒ *the-soldiers train at-days those the soldiers train these days ⇒ *the soldiers train those days

[nominal usage of beinoni form]

(5.21) ⇒ mit’amnim magi‘im ⇒ mit’amnim higi‘u training are-arriving ⇒ training arrived trainees are arriving ⇒ trainees arrived

Shlonsky [112, chapters 2-5] claims that verbal beinoni is a participle, and Hebrew has a null auxiliary. Shlonsky employs Chomsky’s government and binding approach in order to present tense sentences on a par with compound

76 tense constructions – beinoni is a hybrid form, a verb whose agreement features are participial but raised to T 0. In spite of his elegant word-order and clause-structure analysis, we prefer, for our purpose, to avoid modeling syntactic movements, and formalize a definition which is based on the tokens as they appear in the text.

Explicit Subject Noun/adjective usages do not require an explicit subject. Beinoni

in verbal usages require an explicit subject, which can be absent from noun/adjective constructions, i.e., the token mt.apsim (climbers) in the phrase

mt.apsim huˇsqu (climbers were given water) can only be interpreted as a noun/verb usage but not as a present verb.

Adjective modifier An adjective modifier is possible for noun usgae of beinoni, in contrast to present verb [112, pp. 27–28].

[noun]

(5.22) hi’ manhigah dgulah she a-leader outstanding she is an outstanding leader

[beinoni]

(5.22) * *hu’ noheg hagun *he acts decent *he decent acts

Noun usages that cannot be considered as nouns

Complement A complement is not necessarily required for nouns in contrast to noun usages of beinoni form of transitive verbs [112, pp. 27–28].

[verb]

77

(5.23) hi’ manhigah ’et haqbucah she is-leading ACC the-group she is leading the group

[noun] (5.24) hi’ manhigah she a-leader she is a leader

[verb]

(5.25)

hu’ loked nh. aˇsim he traps/is-trapping snakes he traps/is trapping snakes

[beinoni] (5.26) * *hu’ loked *he traps/is-trapping *he traps/is trapping

Genitive ˇsel Noun usage of beinoni cannot modify the genitive ˇsel or be suffixed by possessive pronoun, in contrast to regular nouns [112, pp. 27– 28].

[noun]

(5.27) hi’ manhigah ˇsel qbucah she a-leader POSS a-group she is a leader of a group

78

[noun] (5.28) hi’ manhigatam she a-leader-POSS she is their leader

[beinoni]

(5.29) * *hu’ loked ˇsel mi *he traps POSS ministry the-agriculture *he traps of the agriculture ministry

[beinoni] (5.30) * *hu’ lokddam *he traps-POSS *he traps of them

On the other hand, accusative pronoun suffix, and/or accusative modifica-

tion by a preposition ˇsel, is possible for noun usage of beinoni form.

[beinoni, accusative pronoun] (5.31) hu’ lokddam he traps-ACC he traps them

[beinoni, accusative pronoun]

(5.32)

hu’ lokddam ˇsel nh. aˇsim he traps-ACC snakes he is a snake trapper

79 Construct state Construct state of regular nouns can be either possessive (as nouns) or accusative (as beinoni), in contrast to construct state of benoni which is always accusative.

[noun]

(5.33)

⇒ ˇsomrei hamip‘alim ⇒ haˇsomrim ˇsel hamip‘alim ⇒ haˇsomrim ’et hamip‘alim guards factories ⇒ the-guards POSS the-factories ⇒ that-guard PREP the-factories the factories guards ⇒ the guards of the factories ⇒ that guard the factories

[beinoni]

(5.34)

⇒ *

lokdei hanh. aˇsim ⇒ *halokdim ˇsel hanh. aˇsim ⇒ halokdim ’et hanh. aˇsim trap snakes ⇒ *the-trappers POSS the-snakes ⇒ that-trap PREP the-snakes the snakes trappers ⇒ *the trappers of the snakes ⇒ that trap the snakes

80 Definite article, Relativizer The prefix h represents a definite article for regular nouns, and a relativizer for beinoni usages (i.e., can be replaced by

a ˇs relativizer).

(5.35)

⇒ haˇsomer ˇsel hamip‘alim haˇsomer ‘al hamip‘alim ⇒ ˇseˇsomer ‘al hamip‘alim the-guard POSS the-factories that-guards PREP the-factories ⇒ that-guards PREP the-factories the guard of the factories that guards the factories ⇒ that guards the factories

Note, that for this construction, quantification is not possible for a definite

article, in contrast to relativizers.

(5.37) *

*kol haˇsomer ˇsel hamip‘alim yode‘ ’et tapqido kol haˇsomer ‘al hamip‘alim yode‘ ’et tapqido *all the-guard POSS the-factories knows duty-his all that-guards ACC the-factories knows duty-his *all the guard of the factories knows his duty whoever guards the factories knows his duty

Adjective usages that cannot be considered as adjectives Certain ad- jective usages of beinoni forms do not stand for the adjective tests, suggested by Doron [36].

Negation The negation prefix bilti modifies adjective, in contrast to adjec- tive usage of beinoi.

81

[adjective] (5.38) bilti musmak uncertified un certified

[beinoni] (5.39) * *bilti batel *not unemployed *not unemployed

Complement of verbs Adjectives can appear as complements of the verbs

nir’e, notar in contrast to adjective usage of beinoni.

[adjective]

(5.40) hu’ notar ‘ayep he remains tired he remains tired

[beinoni]

(5.41) *

*hu’ notar bat.el *he remains unemployed *he remains unemployed

Gradability Adjectives are gradable and can be modified by words such as

yoter, haki (more, most), in contrast to adjective usage of beinoni.

[adjective]

(5.42) ⇒

82 hu’ mnahel maclih. ⇒ hu’ hamnahel haki maclih. he a-manager successful ⇒ he the-manager the-most successful he is a successful manager ⇒ he is the most successful manager

[beinoni]

(5.43) ⇒ *

hu’ po‘el bat.el ⇒ *hu’ hapo‘el haki bat.el he a-worker unemployed ⇒ *he the-worker the-most unemployed he is an unemployed worker⇒ *he is the most unemployed worker

Semantics

As mentioned above, according to Gesenius, in contrast to nouns and adjective, participles and verbs are connected with an action or activity. This claim does

not stand for nominalizations. In any case, in contrast to present tense verbs, a

participle can be the agent of a predicate, e.g., hamh. arpim hitnaclu

(the curses apologized), kotbim yac’u lh. upˇsa (writers took a vacation).

Summary

In summary, we recommend to introduce a distinct tag for beinoni forms specifi- cally to avoid the systematic confusion that would otherwise occur between noun, adjective and verb tags. Our main motivation is that beinoni forms have spe- cific syntactic features, which overlap only partially with each one of the major categories.

5.3.3 Conclusion

We suggest four POS tags to cover the various forms of beinoni, by the morpho- logical analyzer:

• Noun – a possible noun analysis should be suggested by the analyzer for any form which is listed as a noun in the lexicon. The noun list should be

83 extended by any participle form of the verbs in the lexicon, if the corpus contains instances of these forms in the role of noun according to the noun phrase construction tests listed in appendix B.1.1.

• Adjective – possible adjective analysis should be suggested by the analyzer for any form which is listed as an adjective in the lexicon. The adjective list should be extended by any participle form of the verbs in the lexicon, if the corpus contains instances of these forms in a role of adjective according to the adjective phrase construction tests listed in appendix B.1.2.

• Participle – the participle option should be suggested for any of the beinoni forms.

• Verb – An option for a verb in present tense should be suggested only for

10

absolute state forms, which have no suffix or bklm prefixes.

Our suggestion, which is supported by morphological, syntactic, and semantic evidence, is different from the Rav Milim approach – which has no beinoni category – and from the KC analyzer – which classifies present verbs as beinoni.

Practical Issues

As reported for the modal tags, an agreement of about 99% was reached among our human taggers with respect to the definition of participle, verb, noun, and adjective categories. The ambiguity level of the analyzer, i.e., the average number of analyses per token, was not significantly changed. Can this approach improve the overall tagging accuracy? Will it better support other applications which make use of the tagger? We intend to review the implications of this decision in the future.

10As mentioned in section 2.4.4, appending pronoun suffix (accusative, nominative) to verbs corresponds to a formal register and is rather uncommon. We adopt the policy of MILA, which limits the suffixation to a closed list of verbs.

84 5.4 Adverbs

5.4.1 Adverbs in Modern Hebrew

According to Nir’s discussion [90, chapter 18], adverbs in Modern Hebrew are the most heterogeneous set of all parts of speech.11. Adverbs are used to modify verbs of all forms, adjectives, quantifiers (especially those expressing inexact quantities),

and full phrases, as follows:

• Verb: hah. oleh munˇsam mlakutit (the patient is being

artificially ventilated).

• Adjective: hu’ mnuseh polit.it (he is politically experienced).

• Quantifiers: yah. asit harbeh yladim (relatively many kids).

• Full phrases: h. ad maˇsma‘it, ’ein nezeq briuti bet.elepon selolari (definitely, no health damage is caused by cellular phone).

In contrast to other languages, such as English (ly) and French (ment), there is no one typical way or consistent method of adverb formation in Modern Hebrew. One can identify various types of adverb derivation:12:

• Closed list of ‘pure’ adverbs, mostly from the Bible [8, p. 594], e.g., pit’om (suddenly), day (sufficiently ), h. inam (free).

• Conversion or null derivation (see section 2.3.2) of adjectives and nouns to

adverbs:

– Singular masculine adjectives: ˇsloˇsa pcu‘im qal (three

lightly wounded), carik lah.ˇsob b’open ycirati (must think creatively), liktob nakon (to write correctly)

11For an overview of the evolution of adverb definition in MH, see [13, pp. 41–43]. 12See also [90, chapter 18], [8, pp. 593–601], [13, pp. 41–43].

85

– Plural feminine adjectives: hu’ diber ’iti gluyot, ‘aniti lo qcarot (he talked to me frankly, I answered him shortly).

13

– Nouns: ’etmol (yesterday), mah. ar (tomorrow), stam

(simply), klum (nothing).

• Suffixation of it to singular nouns – paniti ’eilab. teleponit 14

(I spoke to him telephonically) – or t to singular masculine adjectives –

hu’ lah. ac ’oto ’iˇsit (he pushed him personally).

• Prefixation

15

– b with nouns : bimhirut (quickly), bhaclah. ah (suc- cessfully), bich. oq (jokingly), biktab (in writing).

– b with beinoni, mostly, in pu’al template which is, semantically, closed to adjective: bimrumaz (implicitly), bimyuh. ad (es-

pecially).

– b with adjective, mostly in the Bible: beh. azaq yabo’ (he will strongly come), and in slang: nicah. nu bgadol (we won in a

big way).

– klm with nouns: libri’ut (for health), and adjectives: kmuban (of course). As noted by Nir, this type of prefixation is used for

derivation of adverbs of sentences, with pronoun suffix – lda‘ati

(to my opinion), lda’abonenu (regretfully) – and in parentheses – kanir’eh (apparently), lik’ora (theoretically).

• Collocations – Lexical bound collocation, mostly from the canonic literature:

13 Definite nouns which denote period of time, e.g., hayom (today), haqayic (this summer), can be considered as an expansion of such noun. 14Those adverbs can be looked at as adjectives in singular feminine form, with an omitted noun. 15As noted by Nir, this is the most common type of adverb formation.

86

16

b‘al peh (by heart), qab wnaqi (clearly), as well as some loan

translations: m‘al wume‘eber (above and beyond), dam

qar (cold blood), and phoneticisms: ciq caq (very quickly), h. ap lap (carelessly).

– Free syntactic collocations, mostly based on the pattern:

b’open/bcurah/cderek/bmidah + adjective

(in + adjective + technique/method/way/extent),

e.g., bmidah ˇsaba (in a same extent).

– Binominals: yom yom (daily), parah parah (one at a time).

5.4.2 Distinguishing Criteria

In this section, we present some of the issues raised during the manual tagging process with respect to adverbs.

Adverbs vs. Prepositional phrases The tagging criteria for tokens, consist- ing of a preposition prefix and a noun that modifies a verb, is based on the obser-

vation that adverbs cannot be suffixed by a pronoun. The token bneh. iˇsut

(with decisiveness), for instance, should be tagged as a noun with a preposition

prefix, since it can be suffixed with a pronoun: hakoh. na‘ bnh. iˇsuto ha’opyanit (the force advanced with its typical decisiveness). The token

breciput (continuously), in contrast, will be tagged as an adverb, since no

suffixation is possible: hkuh. na‘ zeh hayom hah. amiˇsi

breciput (the force advanced continuously for five days) *

hkoh. na‘ brciputo ha’opyanit (the force advanced in its typical continuity).

Adverbs vs. Adjectives An adjective that describes the situation of someone

(or something) while performing an action is still an adjective, and not an adverb:

hem hitkadmu ro‘adim (they advanced shaking),

16Collocation modeling is supported by our representation – see section 6.3.1.

87 tinoqet nimc’ah mˇsot.et.et myubeˇset (a baby was found wandering around and dehydrated). Adjectives, unlike adverbs, must agree with the noun or pronoun they modify in gender and number. If changing the gender or number of the subject of the verb calls for a change in the modifier – it is an adjective

and not an adverb. In the following sentence, maher (quickly) is an adverb –

hu’ tiyeg maher (he tagged quickly), while in the next sentence,

matuh. (nervous) is an adjective – hu’ h. ikah matuh. latoca’ot

(he waited nervously for the results), since it agrees in gender and number to

the subject of the verb – hen h. ikah matuh. ut latoca’ot (they waited nervously for the results).

Adverbs vs. Verbs In the following constructions, the verb fills the role of a modal adverb. It is still tagged as a regular verb or as a participle, since it is

inflected by tense:17

(5.44) ⇒

muban ˇseyosi hu’ hamnaceh. ⇒ yihyeh muban ˇseyosi hu’ hamnaceh.

It is clear that Yosi is the winner ⇒ It will be clear that Yosi is the winner

(5.44) ⇒

yosi, mistaber, hu’ hannaceh. ⇒ yosi, mistaber, hu’ hannaceh.

Yosi, probably, is the winner ⇒ Yosi, probably, was the winner

(5.44) ⇒

dome ˇshu’ ’ah. ron ha‘anaqim ⇒ hayah dome ˇshu’ ’ah. ron ha‘anaqim It seems, he is the last giant ⇒ It seemed, he was the last giant

Only lexicalized adverbs are tagged as adverb instead of prefixed verb, e.g., kanir’eh (as seen -= apparently), kmuban (as understood – of course) In the following section we discuss selected test cases.

17Note that some of these forms have an adverb translation in English.

88

Selected Test Cases

• hem nas‘u bah. azarah (they drove back)

Sapir, HMA, Segal, and the treebank tag the token bh. azarah (back) as a prefixed noun. Our test supports the decision of Even Shoshan, Knaani,

Rav Milim, and Yona to tag it as an adverb – it cannot be modified by an

adjective * *hem h. azru bh. azarah mrubah (* they drove

with a lot of back).

• ˇsliˇsit, lo’ nitan kayom lada‘at (third, we cannot con- clude today)

The word ˇsliˇsit (third) is usually tagged as an adjective (Even- Shoshan, Sapir, Knaani, HMA, and Segal), or as an ordinal number (Rav Milim), or a quantifier (Yona). The Treebank tags it as an adverb. The

adjective tag can be selected only if we consider the sentence to contain a hidden omitted noun, e.g., ( ) (the third point). Our judgment

is based on the given tokens as they appear in the sentence. Moreover, no gender inflection, e.g., ˇsliˇsi, or number inflection e.g., ˇsliˇsiyim, were found for this construction. According to our tagging guidelines, all type of numbers (ordinal, cardinal, distributive) in any role (noun, adjective,

and adverb – as for this case), should be tagged as numerals.

• hamemˇsalah tˇsalem 500 ˇsqalim, nosap la´sakar haragil (the government will pay 500 NIS, in addition to the usual salary)

The word nosap (more) is commonly used in Hebrew as an adjective (ad- ditional), as suggested by Rav Milim, Even-Shoshan, Sapir, Knaani, Yona, and HMA. The Treebank tags such cases as a preposition. According to

the Treebank view, the preposition prefix l (to) follows another preposi-

tion nosap, which is against the preposition rules (see section 5.5). We

18 suggest modeling the inter-token word nosap l (more than) with a

18See section 6.3.1 below.

89 preposition tag. Same for the popular variation of bnosap l (with

more than). Note that the token bnosap (in addition) is used to fill

the role of an adverb in constructions such as bnosap, hiˇska‘nu kesep b’oraˇsmut‘eh (In addition, we invested money in a wrong way). For these cases, an adverb tag should be selected, as suggested by the Treebank, in contrast to other dictionaries that tag it as a prefixed

adjective.

• hayebu’ neto histakem b 21 milyard dolar (the import amounted to 21 billion dollars, net)

Even-Shoshan, Sapir, Knaani, HMA, and Segal tag the word neto as a noun, in contrast to the common adjective tag of the English word net. Rav Milim does not attach any tag for this word. In our sentence, the token

can be considered, syntactically, as an adverb of the verb histakem

(amounted), or as an adjective that modifies the noun hayebu’ (the import), or the numeral 21 milyard (21 billion). The Treebank uses the special MOD19 tag. We decided to adopt the first approach and tag it as an adverb, due to the absence of gender and number inflections for this

loan translation.

• siper bhumor ‘al h. avayotab (he told us with humor

about his experiences)

The token can be modified behumor rab (with a lot of humor)

so it should be tagged as a prefixed noun. Same for bkeip, e.g.,

hu’ siper bkeip ‘al h. avayotab (he told us about his experience

with pleasure), although adjective modification is somehow garbled:

hu’ siper bkeip rab ‘al h. avayotab (he told us about his

experience with a lot of pleasure).

• hi’ hudrkah bimˇsutap ‘al ydei

kamah meha’irgunim hah. aˇs’iyim (she was guided by some of the secret or-

19see: http://mila.cs.technion.ac.il/treebank/Decisions-Corpus1-5001.v1.0.ps

90

ganizations together), hu’ tamak bgaluy bmu‘a-

maduto ˇsel perec (He openly supported Perets’s candidate) Most of the dictionaries tag the tokens bimˇsutap and bgaluy as

prefixed adjectives. Knaani and the Treebank analyze bgaluy as an ad-

verb, and the Treebank somehow defines bimˇsutap as a prefixed noun. We found that such types of constructions – prefixed adjective in a role on an

adverb – are restricted to participle adjectives, e.g., bgaluy (obviously),

bkapup (subject to), c.f., * *bh. akam *(with smart) * *brcini

*(with serious). On the other hand, there are also lexical considerations, e.g., * *bkapuy *(with forced), * *bˇsamur *(with reserved). We suggest tagging this closed set of prefixed adjectives of participle form, in

the role of adverb, as an adverb.

• bmabat. miqarob mitbarer ˇsehi’ nimntah ‘im mkimei gandas del viyah (taking a closer look, it turns out that she belonged to the founders of Gandas del Via)

Even Soshan, HMA, Segal, and the Treebank tag the token miqarob

(of close) as a prefixed adjective, which seems to describe a hidden noun

bmabat. m(imaqom) qarob (from a close (point of) view)).

We follow Sapir, Knaani, Rav Milim, and Yona who tag it as an adverb, since it exhibits no gender or number inflections – mqrobah, miqrobim were not found in the corpus. The presence of adverbs in a noun phrase may indicate that nominal gerunds play a verbal role in Modern

Hebrew constructions.

• ’ibadnu ’otam kol kak qarob habaytah (we lose

them so close to our home), ‘adayin

h. aserim 40 meter ˇsehem karob lk-80 milyon qub (40 meters are still missing, which are about 80 million cubic meters)

Even Soshan, HMA, Segal, Yona, and the Treebank tag the token qarob (close) as an adjective. We follow Sapir, Knaani, and Rav Milim who tag it as an adverb. Syntactically, the plural form is not allowed for such construc-

91

tions, e.g., * kol kak qrobim habaytah, *

krobim lk-80 milyon qub. Semantically, in the first case it describes the location of the event (their loss), and in the second case it is equivalent to

‘approximately’.

• hu’ hipsid bimrucat haˇsnatayim ha’ah. ronot (he lost during the last two years)

We follow most of the dictionaries by tagging the token bimruca

(during) as a prefixed noun, rejecting the adverb suggestion of Knaani, due

to the possibility of pronoun suffix modification:

hu’ hipsid bimrucatam ˇsel haˇsnatayim ha’ah. ronot.

ˇ

• hamemˇsalah qiblah ’et hahlat.ah

lip‘ol peh ’eh. ad (the government decided to act unanimously) All dictionaries analyze the collocation peh ’eh. ad (one mouth) as a noun and a numeral. We suggest modeling it as an adverb collocation (see section 6.3.1), since it expresses manner, and does not exhibit any morpho- logical inflections.

5.4.3 Summary

We reviewed the definition of the adverb category of Modern Hebrew, and defined tests to constructively identify adverbs and distinguish them from other categories. These tests can be used by annotators for corpus tagging and by lexicographers while building a lexicon and morphological analyzer. A partial list of adverbs,

found in a sample of 100K tokens, are listed below:

92

5.5 Prepositions

5.5.1 Prepositions in Modern Hebrew

Definition

Ben-Asher [14] discusses the definition and the syntax role of prepositions in Mod- ern Hebrew. He argues against definitions which are based only on semantic crite- ria, such as Goshen et al. [51, p. 4], Nahir [85, p. 7], and Livny & Kochvah [75, p. 97], as well as against Yo’eli [122], Blao [19], and Zadka [127], who involve syntactic considerations such as word ordering and exclude morphological criteria. Ben-Asher’s preposition definition is based, mainly, on morphological criteria with syntactic and semantic considerations:

93

1. In contrast to nouns, only pronominal pronouns can be attached to prepo-

sitions, but not independent pronouns, e.g., * ’eceli *’ecel ’ani

(by-me *by I), * biglalka *biglal ’ata (becuse-of-you *because

you), vs., ‘wdeni q ‘wd ’ani (still-I / still I).

2. There is no plural form for prepositions.

3. The attachment of a pronominal pronoun to a given preposition is based on

either singular or plural baseform, but not on both of them: ’elai

’eleika ’eilab (to-me to-you to-him) * *’eli ’elka ’elo ⇒

negdi negdka negdo (against-me against-you against-him) ⇒ * *negdai negdeika negdab. As noted by Ben-Asher, this rule does not apply

to the third-person suffixation of the preposition bein (between), which has both singular-based beinam (between them) and plural-based beinehem (between-them) suffixations.

4. Prepositions can precede only a noun or a pronoun. Some prepositions –

’ah. arei, lipnei, min, ‘ad (after, before, from, until) – may

come before adverbs, e.g., ‘ad ˇsilˇsom, lipnei ’etmol, ’ahareiˇ mahar,ˇ min hayom (until the day before yesterday, before yesterday, after tomorrow, from today).

5. Semantic considerations, such as the relation between words, may be taken into account in addition to the above criteria.

As for the syntactic role of the prepositions, Ben-Asher follows Brockelmann [24] in considering prepositions as nouns in a descriptive role which starts an adverbial phrase, and where the preposition is the head of the construct state and the noun is the modifier. In some cases, prepositions start indirect object

or modifier phrases. Most of the prepositions can start subordinate clauses[ by

adding the the ˇs (that) morpheme, e.g., lipnei ˇs (before that). Some of

the prepositions may fill the role of conjunction, with no additional morpheme –

lma‘an, b‘od, ‘eqeb, ba‘abur,. terem, me’az (in order

94

Criteria Type Selected Prepositions

Simple

Form

Complex

Fundamental

Source

Derived

Bible

Origin

Mishna/Medieval

Modern Hebrew

Table 5.4: Classification of Hebrew prepositions.

to, while, because of, for, before, since) – or by adding ˇs morpheme –

biglal ˇs, ’al ’ap ˇs, lamrot ˇs (because of, in spite of, despite).

Classification

The Hebrew prepositions can be classified by their form (simple/complex) or by their source (fundamental/derived) [109, p. 162], or by their origin (Bible/Mishna and Middle Ages/Modern Hebrew), as listed in Table 5.4. Complex forms are usually composed of a simple preposition and a noun. Derived prepositions are particles that used to have some other meaning, and became prepositions at a later stage. Ben-Asher argues for a closed set of prepositions.

5.5.2 Distinguishing Criteria

Suffixed preposition vs. Prefixed pronoun Tokens such as bam (in

them) could be tagged either as a preposition baseform b (at) followed by a pronominal suffix hem (them), or as a preposition prefix b (in) followed

by a pronominal suffix hem (they), with no baseform stem. We chose the

first option, since it is consistent with the case of prepositions which cannot be agglutinated, e.g., ‘alab (on him) – there is no prefix ‘al (on).

Preposition vs. Conjunction

95 1. Conjunctions of subordination are generally followed by the binder ˇs (that) and a clause, in contrast to prepositions which are followed by a noun phrase,

and are not used to be followed by :

• nicah. nu ‘al ’ap haˇsiput. habeiti (we won despite

the home referee) – the collocation is a preposition.

• nicah. nu ’ap ‘al pi ˇsehaˇsiput. hayah beiti

(we won although the judgment was local) – the collocation is

a conjunction. 2. Conjunctions of coordination – v (and), ’o (or) – can bind any type of syntactic categories and phrases. Prepositions are always followed by a noun

phrase:

• hu’ h. alah ‘ekeb haznah. at bri’uto (he got sick

after neglecting his health) – the word is a preposition.

• hu’ h. alah, keiban ˇsehiznih. ’et bri’uto (he

got sick, since he neglected his health) – the word is a conjunction.

Prefixed Noun vs. Preposition Some preposition words could be interpreted

as a noun with an agglutinated preposition, e.g.,

bizkut, b‘iqbot, btom, bmeˇsek, bimlot, mita‘am, meh. amat, mipnei, micad (thanks to, as a result of, at the end of, during, as it ended, on behalf of, because of, because of, on part of). The test to determine whether the token is used as a preposition (single word) or as a noun prefixed by a preposition, is the following:

1. If the word ˇsel (of) can be inserted, the word is used as a noun:

• ⇒ * mita‘am roˇshamemˇsalah ⇒ *mita‘am ˇsel roˇshamemˇsalah (on behalf of the prime minister ⇒ *on

behalf of the prime minister) – the token is a preposition.

2. If a quantifier can be inserted before the noun, the word is used as a noun:

96

• ⇒ brah. abei ha‘olam ⇒ bkol rah. abei ha‘olam

(over the world ⇒ all over the world) - the token is a prefixed

noun.

• ⇒ * b‘iqbot ‘aliyat mh. irei hanept. ⇒ *bkol ‘iqbot ‘aliyat mh. irei hanept. (as a result of the raising of the kerosene prices ⇒ *as all a result of the raising of the kerosene

prices) – the token is a preposition.

Selected Test Cases

• mni‘av notru bgeder mistorin ‘ad hayom (his motives remain within the realms of mystery until today)

Even Shoshan and Knaani tag the token bgeder (within the realms of) as an adverb, where Rav Milim, Yona, and the Treebank consider it as a preposition. We follow Sapir, HMA, and Segal, considering it as a prefixed

noun b-geder (with a border). It is not an adverb, since it modifies the

noun mistorin (mystery). It is not a preposition, since the genitive

ˇsel (of) can be inserted: bgeder ˇsel misrtorin (in the range

of mystery).

• hatbi‘ah higi‘ah lh-

eskem ‘im haneˇsam beqeˇser lhemˇsek hahalik hamiˇspat.i (the prosecution got into an agreement with the accused, on the continuation of the judicial pro- cedure)

Even Shoshan, Sapir, Knaani, HMA, and Segal analyze the token bqeˇser

(with respect to) as a prefixed noun b-qeˇser (with-connection), while Rav Milim and Yona consider it as an adverb. The Treebank defines it as

a preposition. We do not think it is an adverb, since it modifies the noun

heskem (agreement) and not the verb higi‘ah (got to). Accord-

ing to our tests, it seems to be a preposition – *

*bkol qeˇser lhemˇsek hahalik hamiˇspat.i (with all respect to the continuation of the juridical procedure) – but in order to avoid two consecutive preposi-

97 tions, we suggest modeling the option of the inter-token word with a

preposition tag (see section 6.3.1) 20.

• biqˇsu kesep tmurat tmikah mu‘amaduto (they asked for money for their support of his candidacy)

The token tmurat (for) is interpreted by most of the dictionaries (Even Shoshan, Sapir, Knaani, Ram Milim, HMA, and Yona) as a construct state

noun, while Segal tags it as a conjunction. It is clearly not a conjunc-

tion – the binder vs (that) cannot be attached as a prefix, e.g., *

*biqˇsu kesep ˇsetmurat tmikah mu‘amaduto (they asked for money that for their support in his candidacy). Should we tag it as a preposition, as suggested by the Treebank? Even though, semantically, it can be read as a preposition as indicated by from its translation, we prefer to keep the traditional ‘prefixed noun’ interpretation, since it does not fit

any of the preposition derivation formations, that were listed above.

• mitgayes riˇson mikereb qhilat ha‘ib- riyim bdimonah (first recruit of the Hebraic community in Dimona)

Most of the dictionaries consider the token miqereb (of the) as a prefixed

noun m-qereb. According to our tests, it should be tagged as a prepo-

sition, as suggested by Rav Milim: * *mkol qereb qhilat

h‘ibriyim (of all the Hebraic community), * *miqereb

ˇsel khilat ha‘ibriyim *(of of the Hebraic community).

• hu’ hayah mnaqeh ’et hazuhamah ˇsenigrma lahem bmo yadab (he was cleaning the dirt, caused by his own hands)

The token is tagged as a conjunction by Even Shoshan and Sapir, as a preposition by Kanaani, Rav Milim, Yona, and the Treebank, and as a

particle by HMA and Segal. Since the token bmo (by) cannot be followed

by the binder ˇs, the conjunction option is rejected. For this case, the

20 In addition to the option of the prefixed noun bqeˇser.

98 preposition tag should be selected (there is no ‘particle’ tag in our tagset).

ˇ

• habanqim ha’aherim, lma‘et. tpah. ot wdisqont. (the other banks, excluding Tfachot and Discount)

The lma‘et (excluding) is tagged by Ram Milim and Yona as a conjunc- tion, by HMA, Segal, and the Treebank as a preposition, by Knaani as an adverb, and by Even Shoshan as an adverb (there is an entry with no POS

for this token in Sapir). Since the binder ˇs cannot follow this token, e.g.,

* *lma‘et. ˇse *(excluding that), the conjunction option is rejected. It is not an adverb, since it starts a modifier of a noun phrase –

habanqim ha’ah. erim (the other banks). The preposition option is preferred

over that of infinitive verb, due to semantic considerations.

• tel abib ˇsel h. op hayam whaderek bo’akah yapo (Tel Aviv of the sea coast, and the way toward Jaffa)

Knaani, HMA, and Segal, tag bo’akah as an adverb. The existence of

an adverb in a nominal sentence is somehow strange. Knaani, HMA, and

Segal may argue for an omitted hidden verb haderek (hamagi‘ah) bo’akah yapo (the way (which gets) to Jaffa)). We follow the

Treebank, considering bo’akah as a preposition. It seems to modify the noun yapo, and it can be replaced with prepositions such as lyad

(near).21

• bin laylah vukah. ki nitan liptor b‘ayah k’uba bat ˇsanim (within a night, it was shown that a problem that was hurting us for many years could be solved)

Even Shoshan analyzed the word bin (within) as the construct state form

22 of the noun ben (son), with a meaning of bemeˇsek (during). Sapir and Knaani tag the word as an adverb, even though it modifies a noun

21

In addition, according to Rabi Nechemia’s rule [117, yebamot p. 13b] –

(in the Bible, the preposition prefix l can be alternatively expressed

by the special suffix h) – such form should be interpreted as lbo’ (near), c.f., (Num. 13:21), which is a clear preposition. 22This is the approach of the classic etymological dictionaries of the old testament, see [67, volume 1, p. 138], [22, vol 2, p. 153].

99 phrase laylah (a night). Moreover, the word bin can be replaced

by a clear preposition, such as tok. The Treebank tags the word as a

preposition. It seems that the motivation for an adverb tag is motivated by the syntactic role of the collocation bin laylah which modifies the verb

phrase ... hukah. ... that follows it. We suggest the preposition tag for

the word bin, or modeling the option of collocation bin laylah as one token with an adverb tag, according to our text representation method

(section 6.3.1).

’arcot habrit tsapeq ’et hacrakim hamiyadiyim ˇsel ha‘am ha‘iraqi, kolel

hath. alah miyadit btoknit ˇsiqum (The United States will supply the immediate needs of the Iraqi people, including immediate beginning of a rehabilitation program)

Most of the dictionaries tag the token kolel (including) as an adjective, excluding the Treebank which tags it as a preposition. We consider it as a

case of participle, since the definite form of the nominal phrase requires inser-

tion of ’et preposition: including

PREP the immediate beginning of the rehabilitation program).

• lbad mehager-

manim acmam hu’ ’eh. ad me‘edei har’iyah halo-yehudim hme‘at.im ˇs´sardu

(beside the Germans themselves, he is one of the few non-jew witnesses that

survived). hapo‘el

yeruˇsalyim hiclih. ah lhapgin ‘elyonut, prat. lim‘idah babayit mul ramat gan

(Hapoel Yerushalayim succeeded in showing supremacy, except for a stum-

ble in the home-game against Ramat Gan),

h. uc mizeh, mikuh. ma‘amado kesenat.or hiclih. l’esop trumot bsak 7 milyon dolar (beside that, as a result of his position as a senator, he managed to raise contributions amounting to 7 million dollars)

The token lbad (beside) is tagged as an adverb by Even Shoshan, Sapir, Knaani, Rav Milim, and Yona, as a noun according to Segal, and as a prepo-

100 sition according to the Treebank. The token prat. (excluding) is tagged as a prefixed noun by most of the dictionaries, excluding the Treebank which

suggests a preposition tag. The token h. uc (beside), of the third sentence, is tagged as an adverb by Even Shoshan, Knaani, and Rav Milim, as a noun

by Sapir, HMA, Segal, and Yona, and as a preposition by the Treebank. As

recommended above for (bqeˇser l) and benosap l, we suggest

modeling the inter-token expressions lbad m, prat l, and h. uc

m, with a preposition tag, in addition to the option of adverb for lbad, and noun for prat. and h. uc.

5.5.3 Summary

We reviewed the definition of the preposition category of Modern Hebrew and defined tests to identify prepositions in a given corpus. These tests can be used by annotators while tagging a corpus, and by lexicographers for building lexicon and morphological analyzer. A partial list of prepositions, found in a sample of

100K tokens, are listed below:

5.6 Conclusion

Morphological disambiguators and corpus annotators must tag all tokens in a text with high agreement. Careful tagset design is required when performing

101 this task on a large-scale corpus: we cannot rely on a foreign tagset, and we cannot rely on existing dictionaries – since, as we have shown, they often disagree. The criteria we used for tag definition are mostly morpho-syntactic with some semantic considerations. The quality of the tag definition should be tested with respect to the disambiguation process (agreement among annotators and accuracy of the disambiguator) and the usage impact on applications which make use of the tagged text. We developed tagging guidelines which bring the annotators into agreement for 99% of the tokens on a corpus of 200K words. We intend to test our tagset design on the disambiguation process and with respect to noun-phrase chunking and named-entity recognition applications.

102 Chapter 6

Computational Model

Quand il revint de ses voyages, en d´ecembre mille neuf cent cinquante- quatre, Bartlebooth chercha un proc´ed´e qui lui permettrait, une fois reconstitu´es les puzzles, de r´ecup´erer les marines

initiales.

One of the objectives of this work is to design a model which supports unsu- pervised learning based on words rather than tokens. The need for a word-based model has two aspects. From a complexity point of view, the huge number of parameters in token-based models makes the training process hard in terms of pa- rameter estimation and time and memory complexity. In addition, word modeling is motivated by linguistic considerations, such as the representation of inter-token and multi-token words. In this chapter, we propose a text representation and algorithms for unsupervised learning of a word-based Hidden Markov Model. 1

6.1 Token-Based HMM

The common formulation of the task of an unsupervised tagger takes the form of a Hidden Markov Model (HMM). In this generative model, a given sequence

1This chapter is based on [2].

103 of observed events is considered to be the emitted output of a stochastic process, over a set of states. Formally, an HMM is defined by a triplet (K, S, µ) where:

K = {k1, ..., kM } output alphabet.

S = {si, 1 ≤ i ≤ N} set of states.

µ = (Π, A, B) probabilistic model:

Π = {πi} initial state probabilities, where πi is the probability to start at

state i, Pi πi = 1.

A = {ai,j} state transition probabilities, where ai,j is the probability to

transit from state i to state j, Pj ai,j = 1.

B = {bi,k} symbol emission probabilities, where bi,k is the probability to

emit symbol k while visiting state i, Pk bi,k = 1.

Algorithm 6.1.1 describes the Markov process for this model.

Algorithm 6.1.1 A program for a Markov process

t:=1

Start with state i1 with probability πi1 while forever do

Emit observation symbol ot = k with probability bi1,k

Move from state it to state it+1 with probability ait,it+1 t := t+1 end while

For the case of tagging, the states correspond to tags, and the tokens are emitted each time a tag is visited. For example, let us consider a tagset S = {nn2, vb3}, a set of word types K = {drinking, start}, and a probabilistic model µ:

Π={πvb = 0.5, πnn = 0.5}, A = {ann,nn = 0.4, ann,vb = 0.6, avb,vb = 0.8, avb,nn =

0.2}, B = {bnn,drinking = 0.7, bnn,start = 0.3, bvb,start = 0.6, bvb,drinking = 0.4}. The Markov process for a sequence of two observations, according to this model, is illustrated in Figure 6.1.

2noun 3verb

104 noun, noun 0.3 noun, verb 0.7 verb, verb 0.1 verb, noun 0.9 noun 0.5 verb 0.5 noun,start 0.4 noun, start 0.4 noun, drinking 0.6 noun, drinking 0.6 verb, start 0.7 verb, start 0.7 verb, drinking 0.3 verb, drinking 0.3

Figure 6.1: Markov process.

Given an output sequence O, the search algorithm looks for the best state sequence, according to the probabilistic model, for this set of observations, i.e., a state sequence X that maximizes P (X|O, µ). For example, if the sentence start drinking is observed, the state sequence {verb, verb}, is most probable, according to the probabilistic model (see Figure 6.2), among other possible state sequences {(noun,noun), (noun,verb), (verb,noun)}:

p((vb, vb)|(start, drinking), µ) = πvbbvb,startavb,vbbvb,drinking = 0.5 · 0.6 · 0.8 · 0.4 = 0.096

p((vb, nn)|(start, drinking), µ) = πvbbvb,startavb,nnbnn,drinking = 0.5 · 0.6 · 0.2 · 0.7 = 0.042

p((nn, vb)|(start, drinking), µ) = πvbbnn,startann,vbbvb,drinking = 0.5 · 0.3 · 0.6 · 0.4 = 0.036

p((nn, nn)|(start, drinking), µ) = πvbbnn,startann,nnbnn,drinking = 0.5 · 0.3 · 0.4 · 0.7 = 0.042

The search for the most probable state sequence for a given output sequence, can be done in O(N 2) time complexity (where N is the number of states/tags), by applying Viterbi’s algorithm [77, 9.3.2], as described in Figure 6.3. The algorithm is based on dynamic programming, where δi(t) denotes the probability of the best

105 noun, noun 0.3 noun, verb 0.7 verb, verb 0.1 verb, noun 0.9 noun 0.5 verb 0.5 noun,start 0.4 noun, drinking 0.6

verb, start 0.7 verb, drinking 0.3

start drinking

Figure 6.2: Markov process for output sequence: start drinking.

state sequence that leads to state i at time t, and ψi(t) denotes the index of the state, at time t − 1, that leads to it.

In order to estimate the parameters of the model, the Expectation Maximiza- tion (EM) algorithm of Baum-Welch [12] is applied over the observed emissions (EM refers to a family of algorithms [77, 14.2.2] and Baum-Welch is a specific in- stance of this general approach). The algorithm starts with a probabilistic model µ (which can be chosen randomly or obtained from good initial conditions), and at each iteration, a new modelµ ˆ is derived to better explain the given output ob- servations. The expectation and the maximization steps of the learning algorithm for a first-order HMM is described in Fig. 6.4. αi(t) denotes the probability to reach state i at time t, and βi(t) denotes the probability to emit ot+1...oT , from state i at time t. The algorithm works in O(N 2) time complexity, where N is the number of states/tags.

106 Initialization

δi(1) = πibi,o1 (6.1)

Induction

δi(t) = max δj(t − 1)aj,ibi,ot (6.2) j

ψi(t) = arg max δj(t − 1)aj,ibi,ot (6.3) j

Termination and path readout

XˆT = arg max δi(T ) (6.4) i ˆ Xt = ψXˆt+1 (t + 1)

P (Xˆ) = max δi(T ) (6.5) i

Figure 6.3: The search algorithm for a first-order token-based model.

Expectation

αi(1) = πibi,o1 (6.6)

αi(t) = bi,ot X αj(t − 1)aj,i j

βi(T ) = 1 (6.7)

βi(t) = X ai,jbj,ot+1 βj(t + 1) j

Maximization

αi(1)βi(1) πˆi = (6.8) Pj αj(1)βj(1) T Pt=2 αi(t − 1)ai,jbj,ot βj(t) aˆi,j = T −1 (6.9) Pt=1 αi(t)βi(t) α (t)β (t) ˆ Pt:ot=k i i bi,k = T (6.10) Pt=1 αi(t)βi(t)

Figure 6.4: The learning algorithm for a first-order token-based model.

107 States PI A A2 B B2 T 3,561 2,014 855,216 40,834,319 3,261,723 39,662,645 W 362 289 54,196 2,556,836 2,318,450 16,028,955

Table 6.1: Model sizes. 6.2 Word-Based HMM

The lexical items of token-based models are the tokens of the language. The implication of this decision is that both lexical and syntagmatic relations of the model are based on a token-oriented tagset. With such a tagset, it must be possible to tag any token of the language with at least one tag. Let us consider, for instance, the Hebrew phrase bclm hn‘im, which contains two tokens. In token-based HMM, we consider such a phrase to be generated by a Markov process, based on the token-oriented tagset of N = 3, 561 tags/states and about M = 500K token types. Line T of Table 6.1 describes the size of a first-order token-based HMM, built over our [A7], [HR], [TB] and [KN] corpora. In this model, we found 2,014 entries for the Π vector (which models the distribution of tags the in first position in sentences) out of possibly N = 3, 561, about 850K entries for the A matrix (which models the transition probabilities from tag to tag) out of possibly N 2 ≈ 12.5M, and about 3.2M entries for the B matrix (which models the emission probabilities from tag to token) out of possibly M ·N ≈ 1.8G. For the case of a second-order HMM, the size of the A2 matrix (which models the transition probabilities from two tags to the third one) grows to about 40M entries, where the size of the B2 matrix (which models the emission probabilities from two tags to a token) is about 40M. Despite the sparseness of these matrices, the number of their entries is still high, since we model the whole set of features of the complex token forms. Let us assume, in contrast, that the right segmentation of words for the sen- tence is provided to us – for example: b clm hn‘im – as is the case for English text. In such a way, the observation is composed of words, generated by a Markov process, based on a word-based tagset. The size of such a tagset for Hebrew is 362, where the size of the Π,A,B,A2 and B2 matrices is reduced to 289, 54K, 2.5M,

108 2.3M, and 16M, respectively, as described in line W of Table 6.1 – a reduction of about 95% for A,A3 matrix, about 30% for the B matrix and about 60% for B2 matrix, when compared with the size of a token-based model. The problem in this approach is that ‘someone’ along the way agglutinates the words of each token, leaving the observed words uncertain. For example, the token bclm can be segmented in four different ways in Table 1.1, as indicated by the placement of the ’-’ in the Segmentation column, while the token hn‘im can be segmented in two different ways. In the next section, we adapt the parameter estimation and the search algorithms for such uncertain output observations.

6.3 Learning and Searching Algorithms for Un- certain Output Observation

In contrast to standard HMM, the output observations of the above word-based HMM are ambiguous. We adapt the Baum-Welch [12] and Viterbi [77, 9.3.2] algorithms for such uncertain observations. We first formalize the output repre- sentation and then describe the algorithms.

6.3.1 Output Representation

Formation The learning and searching algorithms of HMM are based on the output sequence of the underlying Markov process. For the case of a word-based model, the output sequence is uncertain – we do not see the emitted words but the tokens they form. If, for instance, the Markov process emitted the words b clm h n‘im, we would see two tokens (bclm hn‘im) instead. In order to handle the output ambiguity, we use static knowledge of how words are combined into a token, such as the four known combinations of the token bclm, the two possible combinations of the token hn‘im, and their possible tags within the original tokens. Based on this information, we encode the sentence into a structure that represents all the possible ‘readings’ of the sentence, according to the possible word combinations of the tokens and their possible tags.

109 The representation consists of a sequence of vectors, each vector containing the possible words and their tags for each specific ‘time’ (sequential position within the word expansion of the tokens of the sentence). A word is represented by a tuple [id,symbol, state, prev, next], where id is the index of the word in the vector, sym- bol denotes a word, state is one possible tag for this word, prev and next are sets of indices denoting the indices of the words (of the previous and the next vectors) that precede and follow the current word in the overall lattice, representing the sentence. The representation of the sentence bclm hn‘im is described in Fig. 6.5. An emission is denoted in this figure by its symbol, its state index, directed edges from its previous emissions, and directed edges to its next emissions. The states are listed in Table 6.2, a vector representation of the first three time slots is given in Fig. 6.6. In order to meet the condition of the Baum-Eagon inequality [12] that the polynomial P (O|µ) – which represents the probability of an observed sequence O given a model µ – be homogeneous, we must add a sequence of special EOS (end of sentence) symbols at the end of each path up to the last vector, so that all the paths reach the same length.

Additional Usage The above text representation can be used to model multi-

word expressions (MWEs) and inter-token words.

Consider the Hebrew sentence: hw’ ‘wrk dyn gdwl, which can be interpreted as composed of 3 units (he lawyer great/he is a great lawyer) or as 4 units (he edits law big/he is editing an important legal decision). In order to select the correct interpretation, we must determine whether ‘wrk dyn is an MWE. This is another case of uncertain output observation, which can be represented by our

text encoding, as done in Figure 6.7.

As mentioned in section 5.4.2, the token nosap in the sentence

nosap las.akar haragil can be interpreted as an adjective or as part of the

inter-token preposition nosap l, which is composed of the token nosap and the prefix l of the next token las.akar. The representation of such an inter-token word is illustrated in Fig 6.8.

110 bcl 5 m 2 m 2 n’im 10 n’im 10 EOS 11 bclm 3 clm 4 n’im 10 hn’im 7 EOS 11 b 0 clm 5 hn’im 7 hn’im 9 bclm 8 hclm 6 hn’im 9 h 1

cl 5 h 1 EOS 11

EOS 11 hn’im 7

hn’im 9

h 1

Figure 6.5: Representation of the sentence: bclm hn‘im.

[id=0, sym=bcl, state=5, prev={}, next={0}] [id=0, sym=m, state=2, prev={0}, next={2,3,4}] [id=0, sym=m, state=2, prev={4}, next={1,2,3}] [id=1, sym=bclm, state=3, prev={}, next={5,6,7}] [id=1, sym=clm, state=4, prev={2}, next={2,3,4}] [id=1, sym=n‘im, state=10, prev={7}, next={4}] [id=2, sym=b, state=0, prev={}, next={1,2,3,4}] [id=2, sym=clm, state=5, prev={2}, next={2,3,4}] [id=2, sym=hn‘im, state=7, prev={0,1,2,3}, next={4}] [id=3, sym=bclm, state=8, prev={}, next={5,6,7}] [id=3, sym=hclm, state=6, prev={2}, next={2,3,4}] [id=3, sym=hn‘im, state=9, prev={0,1,2,3}, next={4}] [id=4, sym=cl, state=5, prev={2}, next={0}] [id=4, sym=h, state=1, prev={0,1,2,3}, next={0}] [id=5, sym=hn‘im, state=7, prev={1,3}, next={5}] [id=5, sym=EOS, state=11, prev={5,6}, next={4}] [id=6, sym=hn‘im, state=9, prev={1,3}, next={5}] [id=7, sym=h, state=1, prev={1,3}, next={1}] t = 1 t = 2 t = 3

Figure 6.6: Vector representation of the first three time slots.

hw’ 12 ‘wrk dyn 4 gdwl 14 EOS 11 EOS 11

‘wrk 13 dyn 4 gdwl 14

Figure 6.7: Representation of the sentence: hw’ ‘wrk dyn gdwl.

nwsp 13 l 0 skr 6 hrgyl 6

hrgyl 9 EOS 11 nwsp l 16 skr 6 hrgyl 6

hrgyl 9 EOS 11

Figure 6.8: Representation of the sentence: nwsp lskr hrgl.

111 ID Description 0 preposition prefix 1 relativizer 2 pronoun suffix, singulr, masculine, third 3 proper name 4 noun, singular, masculine, absolute 5 noun, singular, masculine, construct 6 noun, singular, masculine, absolute, definite 7 adjective, singular, masculine, absolute, definite 8 verb, infinitive 9 verb, singular, masculine, third, past 10 participle, plural, masculine 11 end of sentence 12 pronoun, singular, masculine, third 13 participle, singular, masculine 14 adjective, singular, masculine, absolute 15 adjective, singular, masculine, absolute, definite 16 preposition

Table 6.2: State list.

Complexity This representation seems to be expensive in term of the number of emissions per sentence. However, we observe in our data that most of the to- kens have only one or two possible segmentations, and most of the segmentations consist of at most one affix. In practice, we found the average number of emis- sions per sentence in our corpus (where each symbol is counted as the number of its predecessor emissions) to be 455, where the average number of tokens per sentence is about 18. That is, the cost of operating over an ambiguous sentence representation increases the size of the sentence (from 18 to 455), but on the other hand, it reduces the probabilistic model by a factor of 10 (as discussed above).

6.3.2 Parameter Estimation

We present a variation of the Baum-Welch algorithm [12] which operates over the lattice representation we have defined above. The algorithm starts with a prob- abilistic model µ (which can be chosen randomly or obtained from good initial conditions), and at each iteration, a new modelµ ˆ is derived to better explain

112 Expectation

α(1, l) = π l b l l (6.11) o1.state o1.state,o1.sym ′ α(t, l) = b l l α(t − 1, l )a l′ l ot.state,ot.sym X ot−1.state,ot.state ′ l l ∈ot.prev β(T¯ , l) = 1 (6.12) ′ β(t, l) = a l l′ b l′ l′ β(t + 1, l ) X ot.state,ot+1.state ot+1.state,ot+1.sym ′ l l ∈ot.next

Maximization

l α(1, l)β(1, l) Pl:o1.state=i πˆi = (6.13) Pl α(1, l)β(1, l) T¯ ′ ′ l ′ l l α(t − 1, l )ai,jbj,ol .symβ(t, l) Pt=2 Pl:ot.state=j Pl ∈ot.prev:ot−1.state=i t aˆ = (6.14) i,j T¯−1 l α(t, l)β(t, l) Pt=1 Pl:ot.state=i T¯ l l α(t, l)β(t, l) ˆ Pt=1 Pl:ot.sym=k,ot.state=i bi,k = T¯ (6.15) l α(t, l)β(t, l) Pt=1 Pl:ot.state=i

Figure 6.9: The learning algorithm for first-order word-based model.

Initialization

δ(1, l) = π l b l l (6.16) o1.state o1.state,o1.sym Induction

′ δ(t, l) = max δ(t − 1, l )a l′ l b l l (6.17) ′ l ot−1.state,ot.state ot.state,ot.sym l ∈ot.prev ′ ψ(t, l) = argmax δ(t − 1, l )a l′ l b l l (6.18) ′ l ot−1.state,ot.state ot.state,ot.sym l ∈ot.prev

Termination and path readout ˆ ¯ XT¯ = argmax δ(T , l) (6.19) 1≤l≤|T¯|

Xˆt = ψ(t + 1, Xˆt+1) P (Xˆ) = max δ(T¯ , l) (6.20) 1≤l≤|OT¯|

Figure 6.10: The searching algorithm for first-order word-based model.

113 the given output observations. For a given sentence, we define T as the number of tokens in the sentence, and T¯ as the number of vectors of the output repre- sentation O = {ot}, 1 ≤ t ≤ T¯, where each item in the output is denoted by l ¯ ot = (sym,state,prev,next), 1 ≤ t ≤ T , 1 ≤ l ≤ |ot|. We define α(t, l) as the l probability to reach ot at time t, and β(t, l) as the probability to end the sequence l from ot. The expectation and the maximization steps of the learning algorithm for a first-order HMM is described in Fig. 6.9. The algorithm works in O(T˙ ) time complexity, where T˙ is the total number of symbols in the output sequence encoding, where each symbol is counted as the size of its prev set.

6.3.3 Searching for the Best-state Sequence

The searching algorithm gets an observation sequence O and a probabilistic model µ, and looks for the best state sequence that generates the observation. We define

l δ(t, l) as the probability of the best state sequence that leads to emission ot, and l ψ(t, l) as the index of the emission at time t − 1 that precedes ot in the best state sequence that leads to it. The adaptation of the Viterbi [77, 9.3.2] algorithm to our text representation for first-order HMM, which works in O(T˙ ) time, is described in Fig. 6.10.

6.3.4 Similar Work

Chinese and Japanese Word Segmentation Many Asiatic languages ex- hibit morphological systems that turn word segmentation from orthographic to- kens into a difficult and ambiguous task. As a consequence, a large corpus of work addresses the issue of learning word segmentation rules in Asiatic languages, in particular Chinese and Japanese. The most similar approach was suggested by Nakagawa [86]. Nakagawa uses a lattice encoding of all possible segmentations given a sequence of orthographic tokens. The main difference is that Nakagawa uses a supervised method, starting from a fully segmented and annotated train- ing corpus. Learning uses regular maximum likelihood to learn the statistical HMM model from segmented training data. For the decoding, Nakagawa intro-

114 duces the lattice encoding of all possible segmentations and applies a variation of Viterbi’s algorithm on this data structure. Nakagawa does not report on the rela- tive improvement that the lattice delivers with respect to a token-based decoding method. We are not aware of work on Asiatic languages that reports on unsupervised learning of segmentation or morphological analysis.

Unsupervised Induction of Morpheme Segmentation Creutz and Lagus [31] present an unsupervised model for morphology induction. Given a raw text corpus in a single language, the training procedure learns a morpheme4 lexicon and a grammar. The grammar is based on the notions of prefix, stem, suffix and non-morpheme items. The learning procedure maximizes the lexicon and the grammar, according to various morphemes properties: (1) their form; (2) their usage – frequency, length, intraword right and left perplexity; (3) the probabilities of category (prefix, stem, suffix, none) membership. The learned model is used by the system (Morfessor) for segmenting a given word into a sequence of morphemes. Evaluation on standard Finnish and English datasets (1.4M word forms in Finnish in 120,000 word forms in English) shows an F-measure of about 70% on the task of morpheme segmentation. Morphology induction is not necessary for Hebrew, since we have access to a good Hebrew lexicon (and a morphological analyzer). Basically, we are addressing a different task than Morfessor and similar morpheme induction systems. Even in the component of our system where we explore learning of unknown words, we rely on full knowledge of a small number of prefixes and suffixes in Hebrew, and this knowledge of Hebrew allows us to generate all possible segmentations of an orthographic token without assuming that the lemma exists in the lexicon. In Hebrew, the set of prefixes and suffixes forms a small closed class. It is not the case in Finnish, and therefore, the motivation to learn morphemes in a completely unsupervised manner is much stronger. Morfessor does not assign one sequence of morphemes to a given token ac-

4For a discussion on the definition of morpheme, see Section 2.2.

115 cording to its context. In addition, Morfessor does not learn the morphological attributes associated to each morpheme (such as gender, number). Our output includes full morphological disambiguation, including all the properties of all the morphemes, and therefore, must rely on context for proper disambiguation. Mor- pheme segmentation does not provide sufficient information for Hebrew processing applications, as we discuss in Chapter 9.

Automatic Speech Recognition (ASR) Morphological disambiguation over a sequence of vectors of uncertain words is similar to tokens extraction in automatic speech recognition (ASR) [65, Chapters 5,7]. The states of the ASR model are phones, where each observation is a vector of spectral features. Given a sequence of observations for a sentence, the encoding – based on the lattice formed by the phones’ distribution of the observations and the language model – searches for the set of tokens, made of phones, which maximizes the acoustic likelihood and the language model probabilities. In a similar manner, the supervised training of a speech recognizer combines a training corpus of speech wave files, together with token-transcription and language model probabilities, in order to learn the phone’s model. There are two main differences between the typical ASR model and ours: (1) an ASR decoder deals with one aspect – segmentation of the observations into a set of tokens, where this segmentation can be modeled at several levels: subphones, phones, and tokens. These levels can be trained individually (such as training a language model from a written corpus, and training the phone] model for each token type, given a transcripted wave file), and then combined together (in a hierarchical model). Morphological disambiguation over uncertain words, on the other hand, deals with both word segmentation and the tagging of each word with its morphological features. Modeling word segmentation, within a given token, without its morphology features would be insufficient. (2) The supervised resources of ASR are not available for morphological disambiguation: we do not have a model of morphological features sequences (equivalent to the language model of ASR) nor a tagged corpus (equivalent to the transcripted wave files of

116 ASR). These two differences require a design which combines the two dimensions of the problem in order to support unsupervised learning (and searching) of word sequences and their morphological features simultaneously.

6.4 Conclusions

We presented a text-encoding method for languages with affixational morphology in which the knowledge of word formation rules (which are quite restricted in Hebrew) helps in the disambiguation. We adapt HMM algorithms for learning and searching this text representation in such a way that segmentation and tagging can be learned in parallel in one step. In 8.5 we experiment with the proposed model on the problem of Hebrew morphological disambiguation, with respect to the token-oriented model.

117 118 Chapter 7

Unknown Words and Analyses

La probl`eme ´etait difficile, car s’il ex- istait d`es cette ´epoque, sur le march´e, diverses r´esines et enduits synth´etiques employ´es par les marchands de jouets pour exposer dans leurs vitrines des puzzles mod`eles, la trace des coupures

y ´etait toujours trop manifeste.

The term unknowns denotes tokens that cannot be analyzed by the morpho- logical analyzer. These tokens can be categorized into two classes of missing information: unknown tokens which are not recognized at all by the analyzer, and unknown analyses, where the set of analyses proposed by the analyzer does not contain the correct analysis for a given token. In this chapter we investigate the characteristics of unknowns in Hebrew, and present methods to handle such unavoidable lack of information1.

7.1 Motivation

In order to estimate the importance of unknowns analyses, we examine tokens, with respect to a given morphological analyzer, in several aspects: (1) The quantity

1I would like to thank Yoav Goldberg for his assistance in implementing the letters Maximum Entropy model and the proper-name SVM model.

119 of unknown tokens, as observed on a corpus of 27M tokens, and classification of a sample of 10K unknown token types (out of 200K). (2) The quantity of unknown analyses, based on an annotated corpus of 200K tokens, and their classification. (3) The portion of unknowns among all other morphological disambiguation errors and their distribution. About 4.5% of the 27M token instances in the training corpus were unknown tokens (45% of the 450K token types). For less edited text, such as random text sampled from the Web, the percentage is much higher - about 7.5%. In order to classify these unknown tokens, we sampled 10K unknown token types and examined them manually. The classification of these tokens with their distribution are shown in Table 7.1. As can be shown, there are two main classes of unknown token types: Neologisms (32%) and Proper nouns (48%), which cover about 80% of the unknown token instances. In addition, we measure the POS distribution of the unknown tokens of our annotated corpus, as shown in Table 7.2. Regarding unknown analyses – based on our annotated corpus, for 3% of the 100K token instances, there was a problem of unknown analysis (3.65% of the token types). The POS distribution of the unknown analyses is listed in Table 7.2. This evidence illustrates the need for resolution of unknowns. The naive pol- icy of selecting ‘proper name’ for all unknowns will cover only half of the errors caused by unknown tokens, i.e., 30% of the whole unknown tokens and analyses. The other 70% of the unknowns (5.3% of the words in the text) will be assigned a wrong tag.

7.2 Strategy

As a result of this observation, we decided, for this stage, to focus on full mor- phological analysis for unknown tokens and identification of proper names for unknown analyses and unknown tokens, according to the following scheme:

120 Distribution Category Examples Types Instances

’asulin (family name) Proper names 40% 48%

’a’udi (Audi)

’agabi (incidental) Neologisms 30% 32%

tizmur (orchestration)

mz”p (DIFS) Abbreviation 2.4% 7.8%

kb”t (security officer)

presentacyah (presentation)

Foreign ’a’ut (out) 3.8% 5.8% right

’abibba’ah. ronah (springatlast)

Wrong spelling ’idiqacyot (idication) 1.2% 4%

ryuˇsalaim (Rejusalem)

’opyynim (typical) Alternative spelling 3.5% 3%

priwwilegyah (privilege )

ha”sap (the”threshold) Tokenization 8% 2%

‘al/17 (on/17)

Table 7.1: Unknown token categories and distribution.

• As an extension to the morphological analyzer, we built a context-free model, which assigns and ranks all possible analyses for a given unknown token.

• The disambiguator is trained on the output of the extended morphological analyzer (which does not contain any unknown token).

• At the third stage, a proper name classifier is trained on a small set of 50K tokens with ‘proper name’ annotation. The features selected for the classifier are based on the output of the disambiguator over this set.

• Given a sentence, the disambiguator assigns a tag to each token and then, as post-processing, the classifier selects the proper names of the sentence, with respect to its tagging.

This strategy accounts for all unknown tokens and the proper names of the unknown analyses – about 80% of the unknowns The remaining 20% are unknown analyses which are not predicted (1.5% of all the words in the text). In the following section, we review previous work on unknown tagging, then we focus on neologism detection and proper names identification.

121 Part of Speech Unknown Tokens Unknown Analyses Total Adjective 7.08% 1.68% 8.76% Adverb 0.84% 0.92% 1.76% Conjunction 0.12% 0.52% 0.64% Negation 0.03% 0.6% 0.63% Noun 12.6% 1.6% 14.2% Numeral 1.08% 2.32% 3.4% Preposition 0.3% 2.8% 3.1% Pronoun / 0.48% 0.48% Proper name 31.8% 24.4% 56.2% Verb 1.8% 0.4% 2.2% Interrogative 0.102% 0.4% 0.502% Quantifier 0.33% 4% 0.73% Modal 0.258% 0.4% 0.658% Prefix 0.288% 0.2% 0.488% Foreign 0.21% 0.4% 0.61% Junk 3% 1.32% 4.32% Participle 0.42% 0.8% 1.22% Copula / 0.8% 0.8% Total 60% 40% 100%

Table 7.2: Unknowns POS distribution.

122 7.3 Previous Work on Unknown Words Tagging

Most of the work that dealt with unknowns in the last decade focused on unknown tokens. A naive approach would assign all possible analyses for each unknown token with uniform distribution, but the performance of a tagger with such a policy will be poor. There are dozens of tags in the tagset and only a few of them may match a given token. Several heuristics were developed to reduce the possibility space and to assign a distribution for the remaining analyses.

Weischedel et al. [120] combine several heuristics in order to estimate the to- ken generation probability according to various types of information – such as the characteristics of particular tags with respect to unknown tokens, the capitaliza- tion property of a given unknown token, and the information encoded by hyphens and specific suffixes. An accuracy of 85% in resolving unknown tokens was re- ported. Dermatas and Kokkinakis [34] suggested a method for guessing unknown tokens based on the distribution of the hapax legomenon, and reported an accuracy of 66% for English. Mikheev [82] suggested a guessing-rule technique, based on prefix morphological rules, suffix morphological rules, and ending-guessing rules. These rules are learned automatically from raw text. They reported a tagging accuracy of about 88%. As part of their work on second-order HMM, Thede and

Harper [118] extended their second-order HMM model with a C = ck,i matrix, in order to encode the probability of a token with a suffix sk to be generated by a tag ti. An accuracy of about 85% was reported. Nakagawa [86] combine word-level and character-level information for Chinese and Japanese word segmentation. At the word level, a segmented word is attached to a POS, where the character model is based on the observed characters and their classification: Begin of word, In the middle of a word, End of word, the character is a word itself S. They apply Baum-Welch training over a segmented corpus, where the segmentation of each word and its character classification is observed, and the POS tagging is ambiguous. The segmentation (of all words in a given sentence) and the POS tagging (of the known words) is based on a Viterbi search over a lattice composed of all possible word segmentations and the possible

123 classifications of all observed characters. Their experimental results show that the method achieves high accuracy over state-of-the-art methods for Chinese and Japanese word segmentation. Hebrew also suffers from ambiguous segmentation of agglutinated tokens into significant words, but word formation rules seem to be quite different from Chinese and Japanese. We also could not rely on the existence of an annotated corpus of segmented word forms.

Habash and Rambow [57] used the root+pattern+features representation of Arabic tokens for morphological analysis and generation of Arabic dialects, which have no lexicon. They report high recall (95%–98%) but low precision (37%–63%) for token types and token instances, against gold-standard morphological analysis. We also exploit the morphological patterns characteristic of semitic morphology, but extend the guessing of morphological features by using character-level features.

Mansour et al. [78] combine a lexicon-based tagger (such as MorphTagger [11]), and a character-based tagger (such as the data-driven ArabicSVM [35]), which includes character features as part of its classification model, in order to extend the set of analyses suggested by the analyzer. For a given sentence, the lexicon- based tagger is applied, selecting one tag for a token. In case the ranking of the tagged sentence is lower than a threshold, the character-based tagger is applied, in order to produce new possible analyses. They report a very slight improvement on Hebrew and Arabic supervised POS taggers.

Resolution of Hebrew unknown tokens, over a large number of tags in the tagset (about 3,100) requires a much richer model than the the heuristics used for English (for example, the capitalization feature which is dominant in English does not exist in Hebrew). Unlike Nakagawa, our model does not use any segmented text, and, on the other hand, it aims to select full morphological analysis for each token, including unknowns.

124 7.4 Neologisms Detection

7.4.1 Method

As described in Section 2.3.2, word formation in Hebrew is based on root+pattern and affixation. These patterns can be used to identify the lexical category of unknowns, as well as other inflectional properties – gender, number, person, etc. Nir [90] investigated word-formation in Modern Hebrew with a special focus on neologisms; the most common formations are summarized in Table 7.5. A naive approach for unknown resolution will add all analyses which fit any of these formations, for any given token. As recently shown by Habash and Rambow [57] – who used the root+pattern+features representation of Arabic tokens for morphological analysis and generation of Arabic dialects, which have no lexicon – the precision of such a strategy can be pretty low. They report high recall (95%–98%) but low precision (37%–63%) for token types and token instances. We examined three models to construct the distribution of tags for unknown words, that is, whenever the KC analyzer does not return any candidate analysis, we apply these models to produce possible tags for the token p(t|w):

Letters A maximum entropy model is built for all unknown tokens in order to estimate their tag distribution. The model is trained on the known tokens that appear in the corpus. For each analysis of a known token, the follow- ing features are extracted: (1) unigram, bigram, and trigram letters of the base-word (for each analysis, the base-word is the token without prefixes), together with their index relative to the start and end of the word. For ex- ample, the n-gram features extracted for the word abc are { a:1 b:2 c:3 a:-3 b:-2 c:-1 ab:1 bc:2 ab:-2 bc:-1 abc:1 abc:-1 } ; (2) the pre- fixes of the base-word (as a single feature); (3) the length of the base-word. The class assigned to this set of features, is the analysis of the base-word. The model is trained on all the known tokens of the corpus, each token is ob- served with its possible POS-tags once for each of its occurrences. When an unknown token is found, the model is applied as follows: all the possible lin-

125 guistic prefixes are extracted from the token (one of the 76 prefix sequences that can occur in Hebrew); if more than one such prefix is found, the token is analyzed for each possible prefix. For each possible such segmentation, the full feature vector is constructed, and submitted to the Maximum Entropy model. We hypothesize a uniform distribution among the possible segmen- tations and aggregate a distribution of possible tags for the analysis. If the proposed tag of the base-word is never found in the corpus preceded by the identified prefix, we remove this possible analysis. The eventual outcome of the model application is a set of possible full morphological analyses for the token – in exactly the same format as the morphological analyzer provides.

Patterns Word formation in Hebrew is based on root+pattern and affixation. Patterns can be used to identify the lexical category of unknowns, as well as other inflectional properties. Nir [90] investigated word-formation in Mod- ern Hebrew with a special focus on neologisms; the most common word- formation patterns he identified are summarized in Table 7.5. A naive ap- proach for unknown resolution would add all analyses that fit any of these patterns, for any given unknown token. As recently shown by Habash and Rambow [57], the precision of such a strategy can be pretty low. To ad- dress this lack of precision, we learn a maximum entropy model on the basis of the following binary features: one feature for each pattern listed in col- umn Formation of Table 7.5 (40 distinct patterns) and one feature for ‘no pattern’.

Pattern-Letters This maximum entropy model is learned by combining the fea- tures of the letters model and the patterns model.

The following example illustrates this analysis process: The token

blystyymhas three possible segmentations: blystyym, b lystyym. The model gives tag distribution for each possible base-word:

• blystyym

– adjective.masculine.plural.absolute 0.7

126 – proper name 0.2

– verb.infinitive 0.1

• lystyym

– adjective.masculine.plural.absolute 0.6

– proper name 0.4

Assuming uniform distribution of the segmentations, the above three distribu- tions are combined into one distribution:

• adjective.masculine.plural.absolute 0.35

• proper name 0.1

• verb.infinitive 0.05

• preposition+adjective.masculine.plural.absolute 0.3

• preposition+proper name 0.2

7.4.2 Evaluation

For testing, we manually tagged the text which is used in the Hebrew Treebank (consisting of about 90K tokens), according to our tagging guideline [40]. We measured the effectiveness of the three models with respect to the tags that were assigned to the unknown tokens in our test corpus (the ‘correct tag’), according to three parameters: (1) The coverage of the model, i.e., we count cases where p(t|w) contains the correct tag with a probability larger than 0.01; (2) the ambiguity level of the model, i.e., the average number of analyses suggested for each token; (3) the average probability of the ‘correct tag’, according to the predicted p(t|w). In addition, for each experiment, we run the full morphology disambiguation system where unknowns are analyzed according by the model. Our baseline proposes the most frequent tag (proper name) for all possible segmentations of the token, in a uniform distribution. We compare the following

127 Analysis Set Morphological Model Coverage Ambiguity Probability Disambiguation Baseline 50.8% 1.5 0.48 57.3% Pattern 82.8% 20.4 0.10 66.8% Letter 76.7% 5.9 0.32 69.1% Pattern-Letter 84.1% 10.4 0.25 69.8%

Table 7.3: Evaluation of unknown token full morphological analysis.

Analysis Set Model POS Tagging Coverage Ambiguity Probability Baseline 52.9% 1.5 0.52 60.6% Pattern 87.4% 8.7 0.19 76.0% Letter 80% 4.0 0.39 77.6% Pattern-Letter 86.7% 6.2 0.32 78.5%

Table 7.4: Evaluation of unknown token POS tagging. models: the 3 context free models (patterns, letters and the combined patterns and letters). The highest coverage is obtained for the combined model (pattern, letter) at 84.1%. We first show the results for full morphological disambiguation, over 3,600 distinct tags in Table 7.3. The highest coverage is obtained for the model com- bining the patterns and letters model. As expected, our simple baseline has the highest precision, since the most frequent proper name tag covers about 60% of the unknown words. The eventual effectiveness of the method is measured by its impact on the eventual disambiguation of the unknown words. For full morpho- logical disambiguation, our method achieves an error reduction of 30% (57% to 70%). Overall, with the level of 4.5% of unknown words observed in our corpus, the algorithm we have developed contributes to an error reduction of 8.3% for full morphological disambiguation. While the disambiguation level of 70% is lower than the rate of 85% achieved in English, it must be noted that the task of full morphological disambiguation in Hebrew is much harder – we manage to select one tag out of 3,561 for unknown words as opposed to one out of 46 in English. Table 7.4 shows the result of the dis- ambiguation when we only take into account the POS tag of the unknown tokens. The same models reach the best results in this case as well (Pattern+Letters). The

128 best disambiguation result is 78.5% – still much lower than the 85% achieved in English. The main reason for this lower level is that the task in Hebrew includes segmentation of prefixes and suffixes in addition to POS classification. We are currently investigating models that will take into account the specific nature of prefixes in Hebrew (which encode conjunctions, definite articles and prepositions) to better predict the segmentation of unknown words.

7.5 Proper Nouns Identification

As shown in Table 7.2, about 56.2% of the unknowns are proper names (31.2% unknown tokens and 24.4% unknown analyses), so it is worth dealing with proper name identification. We built an SVM classifier based on the proper name markings of the [NER] corpus. After testing several features, the properties selected for the model are: POS and words in a window of ±3 words. This classifier is applied as a post-processor after the disambiguator has been applied to the sentence. For each word, it predicts whether it is a proper name – and we override the decision of the disambiguator in the following two cases: the disambiguator predicted a proper noun and the SVM predicts a non-proper noun, or the disambiguator predicted a non-proper noun and the SVM predicts a proper noun.

7.6 Results

Overall, the combination of the pattern+letters model and the proper nouns clas- sifier reached the following results: 70% of the unknowns were analyzed correctly with all morphological features, and 79% of the unknowns were tagged correctly with POS. The baseline consists of tagging all unknowns with the most likely tag – which is proper names. This baseline would never propose segmenting an unknown token. It would miss all tokens which are proper names but have another possible

129 analysis (about 3% of the tokens) and on the actual tokens which have no analysis, it would fail on all tokens which are not proper names (about 1.8% of the tokens). That is, out of 7.5% of the tokens with no analysis provided by the analyzer, the baseline is wrong on 4.8% – that is, the baseline is wrong in 64% of the unknown cases. This is in contrast with the 31% error rate of our method. Overall, the unknown method resolution we have developed reduces the error rate of the disambiguator by 2.8% for POS tagging and segmentation – an error reduction of 23%.

130 Category Formation Example

’iCCeC ’ibh. en (diagnosed)

miCCeC mih. zer (recycled)

Verb Template CiCCen timren (manipulated)

CiCCet tiknet (programmed)

tiCCeC ti’arek (dated)

meCuCaca mˇswh. zar (reconstructed)

Participle Template muCCaC muqlat. (recorded)

maCCiC malbin (whitening)

ut h. aluciyut (pioneership)

ay yomanay (duty officer)

an ’egropan (boxer)

Suffixation on pah. on (shack)

iya marakiyah (soup tureen)

it .tiyulit (open touring vehicle)

a lomdah (courseware)

maCCeC maˇsneq (choke)

maCCeCa madgera (incubator)

miCCaC mis‘ap (branching)

miCCaCa mignana (defensive fighting) Noun a CeCeC pelet. (output)

tiCCoCet tiproset (distribution)

taCCiC tah. rit. (engraving)

Template taCCuCa tabru’ah (sanitation)

miCCeCet micrepet (leotard)

CCiC crir (dissonance)

CaCCan balˇsan (linguist)

CaCeCet ˇsah. emet (cirrhosis)

CiCul .tibu‘ (ringing)

haCCaCa hanpaˇsa (animation)

heCCeC het’em (agreement)

i nora’i (awful)

ani yeh. idani (individual) b c Suffixation oni .telewizyoni (televisional)

Adjective a’i yed. ida’i (unique)

ali st.udentiali (student) d C C aC C aC metaqtaq (sweetish) Template 1 2 3 2 3

CaCuC rapus (flaccid )

ot qcarot (briefly) Suffixation

Adverb it miyadit (immediately)

Prefixation b bekeip (with fun)

a CoCeC variation: ‘wyeq (a copy).

b The feminine form is made by the t and iya suffixes: yeh. idanit (individual), nwcriya (Christian). cIn the feminine form, the last h of the original noun is omitted. d C1C2aC3C2oC3 variation: qt.ant.wn (tiny). Table 7.5: Common neologism formations.

131 132 Chapter 8

Evaluation

Une autre fois, en mille neuf cent soixante-six, il rassembla dans les trois premi`eres heures plus des deux tiers du puzzle de la quinzaine... Puis, pendant les deux semaines qui suivirent, il tenta

en vain de le finir.

In this chapter, we evaluate the system we developed in terms of disambigua- tion accuracy for full morphological analysis, and word segmentation and PoS tagging. The system was trained on the raw text of the [A7], [HR], [TM] and [KN] corpora (of 42M tokens). The test set is composed of the 90K token [HR-T] corpus. Overall, our best results for full morphological analysis is 90% accuracy, and for word segmentation and POS tagging 93%.

As part of the evaluation, we investigated the contribution of initial conditions, the impact of the structure of the stochastic model (dependency relations), the order of the model (bigrams or trigrams), and the size of the training set. In addition, we compared the token-based model with the word-based model, with error analysis.

133 8.1 Influence of Initial Conditions

As shown by Elworthy [41] and Merialdo [81] for English tagging (see Section 2.8.3), good initial conditions for the probabilistic model can significantly improve the training process of unsupervised methods. For Hebrew, we investigated the impact of two types of initial conditions.

Morpho-lexical Context-free approximation of the morpho-lexical probabilities of the tokens in the training corpus. We implemented the algorithm of Levinger et al. [74] (see Section 2.8.3), and adapted it to our tagset (cf. Ch. 5).

Syntagmatic Initialization of syntagmatic conditions:

• Pair Constraints: Hand-crafted syntagmatic constraints on pairs of se- quential tags, as suggested by Shacham and Wintner [111] (see Section 2.8.3). We define four syntagmatic constraints: (1) a construct state form cannot be followed by verb, preposition, punctuation, existential,

modal, or copula; (2) a verb cannot be followed by the preposition ˇsel (of); (3) copula and existential cannot be followed by a verb, and (4) a verb cannot be followed by another verb, unless one of them has a prefix, or the second verb is an infinitive, or the first verb is impera- tive and the second verb is in future tense.1 We did not apply three of the rules from Shacham and Wintner, as they did not match the defi- nition of tags implemented by the KC morphological analyzer or they conflicted with too many sequences, observed in our corpus.

• Initial Transitions: The initial uniform distribution of the state tran- sitions were skewed according to a small seed of randomly selected

sentences (10K annotated tokens). We initialize the p(t|t−2,t−1) distri- bution with smoothed ML estimates based on tag trigram and bigram counts (ignoring the tag-word annotations). .

1This rule was taken from Shacham and Wintner.

134 Initial Condition Context-free Tagger EM-HMM Tagger Dist Morphlex Syntagmatic Full Seg+Pos Full Seg+Pos Uniform 87.1 91.9 Uniform Pair constraints 60 63.8 71.9 87.8 92 Initial transitions 89.8 92.8 Uniform 88 92.1 Morphology-based Pair constraints 76.8 76.4 83.1 88.2 92.1 Initial transitions 90 93

Table 8.1: Initial conditions – scheme 1, model 2-.

For each of these, we first compare the computed p(t|w) against a gold standard distribution, taken from the test corpus (90K tokens), according to the measure used by [74] (Dist). On this measure, we confirm that our improved morpho-lexical approximation improves the results reported by Levinger et al. from 74% to about 76.8% on a richer tagset, and on a much larger test set (90K vs. 3,400 tokens). We then report on the effectiveness of p(t|w) as a context-free tagger that assigns to each word the most likely tag, both for full morphological analysis (3,561 tags) (Full) and for the simpler task of token segmentation and POS tag selection (36 tags) (Seg+Pos). The best results on this task are 76.4% and 83.1% resp. achieved on the morpho-lexical initial conditions. The tagging results, for each of the following initial conditions are summarized in Table 8.1: no initial conditions, morpho-lexical approximation (Morphology- based), syntagmatic initial conditions (Pair constraints, Initial transitions), and combination of morpho-lexical approximation and each of the two initial syntag- matic condition types. Finally, we test effectiveness of the initial conditions with EM-HMM learning. With no syntagmatic conditions, we reach 88% accuracy on full morphological and 92.1% accuracy for POS tagging and word segmentation. As expected, EM-HMM improves results of the context-free tagger from 76.4% to 88%. Strikingly, EM- HMM improves the uniform initial conditions from 64% to above 87%. However, better initial conditions bring us over this particular local maximum. A most in- teresting observation is the negligible contribution of the syntagmatic constraints we introduced. We found that 113,453 sentences of the corpus (about 5%) con- tradict these basic and apparently simple constraints. As an alternative to these

135 Initial Condition Scheme 1 Scheme 2 Morphlex Syntagmatic Full Seg+Pos Full Seg+Pos Uniform 87.1 91.9 87.4 91.7 Uniform Pair constraints 87.8 92 86.9 90.6 Initial transitions 89.5 92.6 88 91.1 Uniform 88 92.1 88.2 92.1 Morphology-Based Pair constraints 88.2 92.1 88.1 91.8 Initial transitions 90 93 89.6 92.7

Table 8.2: Dependency schemes – model 2-.

common-sense constraints, the small seed of transition initialization (InitTrans) has a great impact on accuracy. Overall, we reach 90% accuracy for full mor- phological analysis and 93% for word segmentation and PoS tagging – an error reduction of more than 15% from the uniform distribution, and about 60% error reduction from the context-free baseline.

8.2 Structure of the Stochastic Model: Depen- dency Scheme

In this experiment, we investigated two dependency schemes for the syntagmatic and the lexical information. According to Scheme 1, the probability of a tag is conditioned by the two tags that precede it, and the probability of an emitted word is conditioned by its tag and the tag that precedes it (see Fig. 8.2). We examined an alternative Scheme 2: the probability of a tag is conditioned by the tag that precedes it and by the one that follows it. The probablity of an emitted word, according to this scheme, is conditioned by its tag and the tag that follows it, as illustrated in Fig. 8.5. The learning and the searching algorithms were adapted for this scheme. The intuition-motivating experiment of Scheme 2 is that the 2 closest neighbors of a word are more likely relevant to its classification than a word with distance 2. This hypothesis was verified for English as well in [119] for graphical models and [10] specifically for HMMs. For the case of Hebrew, Scheme 1 achieved better results than scheme2, for most type of initial conditions, as shown in Table 8.2.

136 Scheme 1 Scheme 2 Order Full Seg+PoS Full Seg+PoS 1 89.5 92.7 89.2 92.5 2- 90 93 89.6 92.7 2 89.9 93 89.5 92.6

Table 8.3: Model order – initial conditions

8.3 Model Order

In this experiment, we examine the influence of the model order on the accuracy. In the first-order model ‘1’, the probability of a tag is conditioned by the tag that precedes/follows it (according to the dependency scheme), and the probability of an emitted word depends on its tag, as illustrated in figures 8.1, 8.4. In the partial second-order model ‘2-’, the probability of a tag is conditioned by the two tags that precede it or by the tag that precedes it and by the one follows it (according to the dependency scheme), as illustrated in figures 8.2, 8.5, and in the full second-order model ‘2’, the probablity of an emitted word is conditioned by its tag and the tag that precedes/follows it (according to the dependency scheme), as illustrated in figures 8.3, 8.6.

We used the backoff smoothing method suggested by Thede and Harper [118], with an extension of additive smoothing [28, 2.2.1] for the lexical probabilities (B and B2 matrices). The results are listed in Table 8.3.

As can be seen, Models 2- and 2 significantly improve the accuracy when compared to Model 1. But the difference between 2- and 2 is not significant. This means we can safely use Model 2-, which is much smaller in size than Model 2 (16M entries instead of 39M, see Section 6.1). We hypothesize that the reason Models 2- and 2 give close results is that our initial conditions are dominated by

the context-free method and, therefore, the dependency between ti−2 and wi are not initialized to any useful value when training Model 2.

137 Size Initial Condition Full Seg+PoS Uniform 86 91 InitTrans 89.3 92.4 10.5M Morphology-Based 87 91.6 InitTrans+Morphology-Based 89.6 92.7 Uniform 86.4 91.6 InitTrans 89.3 92.5 21M Morphology-Based 87.7 92 InitTrans+Morphology-Based 89.8 92.9 Uniform 86.9 91.9 InitTrans 89.5 92.6 31.5M Morphology-Based 87.9 92 InitTrans+Morphology-Based 89.8 93 Uniform 87.1 91.9 InitTrans 89.5 92.6 42M Morphology-Based 88 92.1 InitTrans+Morphology-Based 90 93

Table 8.4: Training set size – model 2-, scheme 1.

8.4 Influence of the Training Set Size

In the training process, the system iterates on the corpus in order to improve the probabilistic model. In this experiment, we examined the training of the word model on corpora of 10.5M, 21M, 31.5M and 42M tokens, as described in Table 8.4. As can be seen, the unsupervised learning keeps improving as more tokens are provided for training.

8.5 Token-oriented Model vs. Word-oriented Model

In this experiment, we compare learning via a word model (with 362 states) and the token model (with 3,561 states). The results are listed in Table 8.5. The word model reaches better results (error reduction of 5% and 13% for full morphology and POS and for segmentation, respectively). In addition, the token model is so big that it is impractical: the model size reaches over 4 Gigabytes as opposed to 1 Gigabyte for the word model. Tagging time takes at least two times as long for the token model compared to the word model.

138 Full Seg+PoS Words 90 93 Tokens 89.5 92

Table 8.5: Word model vs. Token model – scheme 1, model 2-, initial conditions

8.6

The confusion matrix of the POS errors of the best model we achieved (Word- based Dependency 1 trained on 42M tokens with combined initial conditions) is given in Table 8.6. The matrix is to be read as follows: a cell with row X and column Y indicates the percent of the errors that are caused when the tagger decided on tag Y for a real tag X. e denotes an error rate inferior to 0.1% and / indicates that no errors of this type were detected. Bold entries in the table focus on the key contributors to the confusion matrix: they indicate all entries that account for more than 2% of the errors. About 60% of the errors are caused by a confusion between PN (proper noun), NN (noun) and other categories. The key confusions are between Noun and Participle, Proper Name, Verb, and Adjective. We also identify specific errors between tags that were introduced ‘recently’ into the tagset: copula and pronoun account for over 5% of

the errors in very few instances. (Systematic confusion between hu’ he/is.) It should be noted that the key confusions correspond to cases that are intrin- sically ambiguous: the distinctions between present-tense verb and participles, adjectives and nouns, and adjectives and participles, are difficult to define, corre- sponding to long and complex explanations in the tagging guidelines we prepared, and caused significant confusion among human taggers as well. To some extent, this indicates that the highly confusing pairs are ‘suspicious’ in the sense that they correspond to tags that are not well-defined (either the ‘true tag’ or the tag selected by the tagger could be wrong, or more than one tag could be appropriate in the specific context).

139 NN VB JJ RB PP CC NG NU PR PN QN EX CP MD PA IN PF JC TT NNa / 2.9 2.5 2.4 2.8 0.8 / 2 e 5.1 0.5 / / / 0.9 e 0.2 0.4 e 20.9 VBb 2.1 / 0.1 0.1 e e / / / 0.5 0.1 / / e 0.4 / / / / 3.6 JJc 3.3 0.9 / 0.7 e 0.1 e e e 0.8 / / / 0.4 1.9 / / e / 8.5 RBd 1.4 e 0.2 / e 0.5 e / 0.6 0.2 1.7 / / 0.3 / / / / / 5.1 PPe 1.5 2.8 0.2 0.4 / 2.2 0.6 / 0.9 1.9 0.1 / / / e / 0.1 / / 11 CCf e / / e 1.3 / 0.1 / e e e / / / / 0.8 / / / 2.5 NGg / / / / / 0.2 / / / e / 0.9 / / / / e / / 1.2 NUh 0.8 e / e / / / / / e / / / / e / / / / 1 PRi 0.2 / / e e 0.3 / / / / 0.1 / 2.2 / / 0.3 / / / 3.3 PNj 11.4 1.2 1.2 0.4 0.9 0.2 e 0.8 e / 0.1 / / / / e e 0.5 / 17.7 QNk 1.7 / / 1.3 / 0.6 e / / / / / / e e / / / / 3.8 EXl / 0.2 / / / / e / / / / / 0.5 e / / / / / 0.8 CPm 0.3 / / / e 0.2 / / 4.2 / / 0.2 / / / / / / / 4.9 MDn 0.6 / 0.2 0.1 / / 0.3 / / / / 0.9 / / 0.4 / / / / 2.1 PAo 5.4 2.8 1.9 0.1 e / / / e 0.2 / / / 0.1 / / / / / 10.5 INp e / / 0.5 / 0.2 / / 0.2 / 0.3 / / / / / / / / 1.2 PFq 0.6 e 0.2 e 0.2 / 0.3 / / 0.2 e e / / / / / / / 1.6 JCr e / / e / 0.1 / / / / / / / / / / / / / 0.2 TTs / e / / / / / / / / / / / / / / / / / e 29.6 11 6.5 6.3 5.4 5.5 1.7 2.9 6.1 9.7 3 2.1 2.7 0.9 3.8 1.2 0.4 1 e 100

anoun bverb cadjective dadverb epreposition fconjunction gnegation hnumeral ipronoun jpropername kquantifier lexistential mcopula nmodal oparticiple pinterrogative qprefix rinterjection stitular

Table 8.6: Confusion matrix.

140 ti−1 ti

wi

Figure 8.1: First order model – Dependency scheme 1.

ti−2 ti−1 ti

wi

Figure 8.2: Partial second order model – Dependency scheme 1.

ti−2 ti−1 ti

wi

Figure 8.3: Second order model – Dependency scheme 1.

141 ti ti+1

wi

Figure 8.4: First order model – Dependency scheme 2.

ti−1 ti ti+1

wi

Figure 8.5: Partial second order model – Dependency scheme 2.

ti−1 ti ti+1

wi

Figure 8.6: Second order model – Dependency scheme 2.

142 Chapter 9

Applications

Bartlebooth est assis devant son puz- zle. C’est un vieillard maigre, presque d´echarn´e, au crˆane chauve au teint cireux, aux yeux ´eteints... il agrippe de la main droite l’accoudoir du fauteuil cependant que sa main gauche, pos´ee sur la table dans une posture peu na- turelle, presque `ala limite de la con- torsion, tient entre le pouce et l’index

l’ultime pi`ece du puzzle.

In order to evaluate the disambiguation system we measured its contribution to other applications, which use the tagged text. In this chapter we describe two applications which were implemeted for this purpose: Noun-phrase chunking and Named-entity recognition. We show that the morphological features given by the disambiguator improve the performance of these systems.

9.1 Noun-Phrase Chunking

NP chunking is the task of labeling noun phrases in natural language text. The input to this task is free text with part-of-speech tags. The output is the same text with brackets around base noun phrases. A base noun phrase is an NP which

143 does not contain another NP (it is not recursive). NP chunking is the basis for many other NLP tasks such as shallow parsing, argument structure identification, and information extraction.

9.1.1 Previous Work

Text chunking (and NP chunking in particular), first proposed by Abney [1], is a well studied problem for English. The CoNLL2000 shared task [129] was about general chunking. The best result achieved for the shared task data was by Zhang et al. [129], who achieved NP chunking results of 94.39% precision, 94.37% recall, and 94.38 F-measure using a generalized Winnow algorithm and enhancing the feature set with the output of a dependency parser. Kudo and Matsumoto [70] used an SVM-based algorithm and achieved NP chunking results of 93.72% precision, 94.02% recall, and 93.87 F-measure for the same shared task data, using only the words and their POS tags as features. Similar results were obtained using Conditional Random Fields on similar features [110]. The NP chunks in the shared task data are baseNP chunks – which are non-recursive NPs, a definition first proposed by [98]. This definition yields good NP chunks for English, but results in very short and uninformative chunks for Hebrew (and probably other Semitic languages). Recently, Diab et al. [35] used an SVM-based approach for Arabic text chunking. Their chunks data were derived from the LDC Arabic TreeBank using the same program that extracted the chunks for the shared task. They used the same features as Kudo and Matsumoto [70], and achieved over-all chunking performance of 92.06% precision, 92.09% recall, and 92.08 F-measure (the results for NP chunks alone were not reported). Since Arabic syntax is quite similar to that of Hebrew, we expect that the issues reported below apply to Arabic results as well.

9.1.2 Hebrew Simple NPs

Goldberg et al. [50] have shown that due to syntactic features such as construct

state (), the traditional definition of base NP chunks does not translate well

144 to Hebrew and probably to other Semitic languages. Instead, they define the no- tion of Simple NP chunks. The definition of Simple NPs is pragmatic. We want to tag phrases that are complete in their syntactic structure, avoid the requirement of tagging recursive structures that include full clauses (relative clauses for example) and, in general, tag phrases that have a simple denotation. To establish their def- inition, they start with the most complex NPs and break them into smaller parts by stating what should not appear inside a Simple NP: (1) Prepositional phrases;1 (2) Relative clauses; (3) Verb phrases; (4) Apposition;2 (5) Some conjunctions.3

9.1.3 Evaluation

Goldberg et al. [50] describe an SVM model for Hebrew SimpleNP chunking which used our morphological disambiguator. It was shown that using morphological fea- tures improves chunking accuracy. The key improvement is obtained by introduc- ing lexical features: augmenting the POS tag with lexical information boosted the F-measure from 77.88 to 92.44. The addition of the extra morphological features of Construct state and Number yields another increase in performance, resulting in a final F-measure of 93.2% (precision improvement by 0.5% and recall by 1%). However, the set of morphological features used should be chosen with care – the Gender feature hurts performance, even though Hebrew has agreement on both Number and Gender. We do not have a good explanation for this observation. The results are given in Table 9.1. The features are denoted as follow: W – word, P – Pos, G – gender, N – number, C – construct status, E – the features are based on the morphological disambiguator. In further analysis [49], it was shown that the set of lexical features can be reduced to a few words (less than 500) without affecting overall results – but morphological features remain critical to the stability of the results.

1 Except % related PPs, e.g., 5% 5% mehamkirot (5% of the sales). The preposition

ˇsel (of) is not considered a PP. 2Apposition structure is not annotated in the TreeBank. As a heuristic, we consider every comma inside a non-conjunctive NP which is not followed by an adjective or an adjective phrase to be marking the beginning of an apposition. 3As a special case, Adjectival Phrases and possessive conjunctions are considered to be inside the Simple NP.

145 Features Acc Prec Rec F P 91.77 77.03 78.79 77.88 WP 97.49 92.54 92.35 92.44 WPE 94.87 89.14 87.69 88.41 WPG 97.41 92.41 92.22 92.32 ALL 96.68 90.21 90.60 90.40 WPNC 97.61 92.99 93.41 93.20 WPNCE 96.99 91.49 91.32 91.40

Table 9.1: Hebrew NP-chunking results.

9.2 Named Entities Recognition

The Named Entity Recognition (NER) task involves identifying noun phrases that are names and assigning a class to each name. This task has great significance in the field of information extraction. A considerable amount of work has been done in recent years on named entity taggers in many languages. The shared tasks of MUC-7, CoNLL-2002, and CoNLL-2003 concerned the NER problem. Recent work includes both knowledge engineering and machine learning approaches. Ma- chine Learning approaches have the advantage of being dynamic (they adapt to new variety of text) and their development does not require expensive profes- sional linguistic knowledge. Machine learning approaches include Maximum En- tropy Models (ME), Hidden Markov Models (HMM), and support vector machines (SVM). High accuracy results have been achieved for the English language. In English the problem is greatly simplified by the fact that most named entities start with a capital letter. In non-English languages, reported performances are significantly lower. Ben-Mordechai [15] investigated the NER task in Hebrew. Hebrew names originate from different sources. Some Hebrew names function as nouns, verbs, and adjectives. Hebrew does not have the advantage of using capital letters and the order of the words in a Hebrew sentence is less structured than in English. These properties, as well as the prefixation mechanism (see 2.4.4, 2.4.3), and other qualities unique to the Hebrew language and culture,4 make it hard to automatically determine the role of a word in a sentence, and might influence the

4Such as use of a Hebrew calendar and the use of Latin and foreign names.

146 NER problem.

9.2.1 Models

Ben-Mordechai focused on recognizing entity names (person, location organiza- tions), temporal expression (date, time), and number expression (percent, money). The Hebrew NER system was trained over three different models:

Baseline The baseline system we created for Hebrew is based on regular expres- sion and a lexicon extracted from the training data. The regular expressions identify simple date, time, percent, and money expressions. The lexicon consists of named entities which appear in the training data. The system selects complete unambiguous names which appear in the lexicon.

HMM States are defined as a product of the set of possible name classes and the set of possible POS tags. Meaning, there would be a state for PERSON + NOUN, PERSON + VERB, etc. In addition, special states were defined for the beginning and end of a sentence. Overall, 212 states were extracted. The intuition for this state definition came from the fact that the syntactic structure of the sentence has great impact on the prediction of name classes. The alphabet was defined by stringing the following features for each word in the corpus: features which represent regular expression; features which indicate if the word is within inverted commas; dictionary features for the current word and a window of ±2 words around it; features which indicate if the current word is in or a part of an expression in one of the preprocessed lists.

ME The Maximum Entropy probabilistic modeling technique has proved to be a very powerful one, which can handle a large quantity of statistical informa- tion (can cope with a large number of features). As opposed to the HMM, an ME model treats each feature separately. It gives each feature a weight according to its impact on the name class prediction. An ME system con- structs a statistic-probabilistic model that is able to evaluate the likelihood

147 Precision Recall F-measure PER 90.66% 73.82% 81.38% LOC 83.09% 82.8% 82.94% ORG 77.14% 62.03% 68.77% DATE 90.2% 85.18% 87.62% TIME 77.78% 87.5% 82.35% MONEY 85.71% 85.71% 85.71% PERCENT 97.83% 86.67% 91.91% Overall 84.54% 74.31% 79.1%

Table 9.2: Named Entity Recognition – The combined model results.

of every word being in one of the above-mentioned categories. The system estimates probabilities based on the principle of making as few assumptions as possible. Constraints are derived from training data, expressing some relationship between features and outcomes. We look for the probability distribution which is uniform except under the derived constraints. This is the distribution with the highest entropy out of all the distribution which satisfies our constraints.

9.2.2 Evaluation

Each of the above models was trained on the same training data. The best results were achieved by combining the three models, as follows: Given a new text, the first stage in the process was sentence detection and tokenization. In the second stage, each system tagged the text separately. In the third stage a merge of the tagging results was performed. The main principle of the merging method was to use the ME prediction and the other predictions as backup. Meaning, if the ME system did not assign any name class to a word, then the other predictions are taken into consideration. This principle had an exception in case of a prediction of the name class time. It was found that local features are the most important ones. The most dominant features out of the feature set were the dictionary features and POS tags. The combined system experimental results are presented in Table 9.2.

148 Chapter 10

Contributions and Future Work

Il serait fastidieux de dresser la liste des failles et des contradictions qui se r´ev`el´erent dans le projet de Bartle- booth... Il est difficile de dire si le projet

´etait r´ealisable...

10.1 Contributions

In this work, we investigated unsupervised methods for Hebrew morphological disambiguation. The main contributions of this work are:

Analysis system for Hebrew We have implemented a complete analysis sys- tem for Hebrew that combines all the algorithms and models described in this work. Given a Hebrew text, the system assigns a full set of mor- phological features for each word, extracts noun phrases, and recognizes entity names (persons, locations, organizations, temporal and number ex- pression). A fully operating version of the system is available online at: http://www.cs.bgu.ac.il/∼nlpproj/demo. The system is implemented in Java, and operates at a rate of about 1,000 words analyzed per second on a standard 2007 PC (1GB of RAM).

149 Unsupervised learning model for an affixational language In contrast to English tag sets, whose sizes range from 48 to 195, the number of tags for our Hebrew corpus, based on all combinations of the morphological at- tributes, is about 3,600 (about 70 times larger). The large size of such a tag set is problematic in terms of data sparseness. Each morphological combi- nation appears rarely, and more samples are required in order to learn the probabilistic model.

In order to avoid this problem, we introduce a word-based model, including only about 300 states, and where the size of the HMM matrices is reduced by close to 90%. We have defined a text-encoding method for languages with affixational morphology, in which knowledge of word formation rules (which are quite restricted in Hebrew) helps in the disambiguation. We adapt HMM algorithms for learning and searching this text representation in such a way that segmentation and tagging can be learned in parallel in one step.

The application of this model as opposed to the traditional token-based model improves model accuracy with over 13% error reduction.

Initial conditions Initial conditions are essential for a high quality unsupervised learning of HMM models. We investigated two methods for initial conditions: morpho-lexical approximations, and syntagmatic conditions. Our main work was to adapt them to the comprehensive tagset we have designed for this work. We have shown that good initial conditions improve model accuracy with over 15% error reduction.

Unknown words analysis The term unknowns denotes tokens that cannot be analyzed by the morphological analyzer. These tokens can be categorized into two classes of missing information: unknown tokens which are not recog- nized at all by the analyzer, and unknown analyses, where the set of analyses proposed by the analyzer does not contain the correct analysis for a given token.

We investigated the characteristics of unknowns in Hebrew, and methods for

150 unknowns resolution. For the case of unknown tokens, we examined pattern- based and letter-based models. The combined letter+pattern model gives the best results, in term of coverage and accuracy. This model provides a distribution of possible tags for unknown tokens. Unknowns are then pro- cessed by the disambiguator according to the distribution the letter+pattern model provides, as if they had been present in the lexicon. In addition, we developed a post-processing model to recognize proper names within the output of the disambiguator.

Unknowns account for 7.5% of the tokens in our corpus. The patterns+letters- based model with the proper name classifier properly classifies 79% of these instances, thus contributing an error reduction of 23% as opposed to the baseline consisting of tagging unknowns with the most likely tag (Proper Name – which would only tag correctly a third of the unknowns).

Evaluation The system was evaluated according to two criteria: (1) The accu- racy of the disambiguation process for a full morphological analysis and for word segmentation and POS tagging; (2) The contribution of the disam- biguator to other applications, which use the tagged text.

The disambiguator was tested on a wide-coverage test corpus of 90K tokens. We report an accuracy of 90% for full morphological disambiguation, and 93% for word segmentation and POS tagging. As part of the evaluation, we compared several graphical dependency schemes, model orders, and different sizes of training data.

In addition, we implemented two applications in order to estimate the impact of the morphological data given by the disambiguator: Noun-phrase Chunker and Named-entity Identifier. Both applications showed improvement due to the improved morphological information provided by our disambiguator.

Construction of a high-quality large-scale annotated corpus We developed a tagged corpus of about 200K tokens. The corpus is composed of articles of two daily newspapers – Ha’aretz and Arutz 7. We developed a detailed set

151 of tagging guidelines over a period of 3 years to make sure human taggers reach full agreement. Each article in our corpus was manually tagged by four taggers and disagreements were systematically reviewed and resolved.

Tagset for Hebrew The main morphological property of words is their lexical category – their part of speech. The central parts of speech (verb, noun and adjective) are part of the basic linguistic intuitions of all speakers. However, while working on annotation of Hebrew text, we surprisingly realized that a complete list of parts of speech is not well established, and that there is no agreement, among dictionaries and automatic tools, on the part-of-speech set for Hebrew. Beyond verb, noun, and adjective, many other lexical units appear in text, and each raises potential questions as what is meant by part of speech, what is the best way to label every unit in a document, and how to distinguish among the various labels. The tags we suggests for Hebrew are: adjective, adverb, conjunction, copula, existential, interjection, interrogative, modal, negation, noun, numeral, prefix, preposition, pronoun, proper name, quantifier, title, verb. Our main conclusion is that the tagset and the tagging criteria used for a given language cannot be imported from another and existing dictionaries cannot be replied upon. Instead, it should be specifically defined over large-scale corpora of a given language in order to tag all words with high agreement. In this work, we have detailed the method we applied to design a comprehensive tagset for Hebrew and reported the remaining intrinsically difficult confusion cases.

10.2 Future work

Tagset The design of a comprehensive well-defined tagset for Hebrew remains the most complex challenge in this area. To complete our work on the tagset, we should measure its effectiveness on the morphological disambiguation and its impact on other applications which make use of the POS information. We

152 are currently adapting the lexicon of the analyzer to support various kinds of tagsets, in order to compare their implication on a range of applications.

Other Hebrew Genre In order to justify the need for unsupervised model, we should evaluate the system on various Hebrew text genre (ancient, literature, medical, chat), investigating the appropriate combination of different sources while learning the stochastic model.

Arabic and Amharic Other semitic languages exhibit morphology quite similar to Hebrew. Several works dealt with Arabic morphological disambiguation, supervised by the Arabic treebank (ATB). The resources similar to those we used – morphological analyzer and raw text corpora – are available for Arabic, and the unsupervised method we present can be applied. We intend to compare our method to existing morphological disambiguators in Arabic.

Applications We are in the process of developing two applications which make use of the morphological data given by the disambiguator: Multi-word ex- pressions identifier (inter-token expressions can be naturally represented ac- cording to our text encoding) and Word Prediction system. Traditional word prediction systems are based on a language model. We would like to combine syntagmatic information with lexical data, in order to improve the prediction process.

Initial conditions Morpho-lexical approximation of a word in context can be very valuable as an initial condition for the model. However, we found that the context-free method, suggested by Levinger et al. is hard to implement for words in context. In a naive approach, the similar word set of the context is given by the cartesian product of the similar word sets of the words in the same context. The problem with this approach is that even for a wide corpus and a small context of two words, the occurrences of the similar contexts are very rare. Possible solutions may extend the corpus to be based on the Internet (e.g., Google queries) or use of sophisticated smoothing techniques. This issue should still be investigated.

153 DBN In this work we use HMM for the unsupervised learning. However, HMM has disadvantages in modeling multi-feature states, such as a morphological tag which is composed of POS, gender, number, etc. According to this design, the common nature of two categories, such as nouns and participles, cannot be expressed. Moreover, two forms of a same category which differ in one attribute – such as two verbs with the same gender/number/person attributes but different tense – are considered to be totally distinct. This problem can be faced by using the Dynamic Bayes Network, which enables modeling of dependencies between complex states.

Dynamic Bayes Network (DBN) generalizes HMM by allowing the state space to be represented in factored form, instead of a single discrete ran- dom variable [84]. In the engineering community, DBNs have become the representation of choice because they embody a good tradeoff between ex- pressiveness and tractability, and include the vast majority of models that have proved successful in practice.

Beyond the technical aspect, the difficulties we face in providing a non- ambiguous definition for certain pairs of tags (e.g., distinguish nouns and participles), indicate that we could benefit from a hierarchical definition of tags, similar in spirit to the approach proposed by Rosen [102, chapter 5].

Other NLP problems The word-based model we suggest can help the learning process of other NLP problems which are hard to model with tokens.

154 Appendix A

Hebrew Morphology

A.1 Verb Inflections

In Table A.1, taken from [106, section 5.2.2.1], all the gender/number/person/tense

inflections of verbs of all patterns ( binyanim) are listed. In addition, each pattern has its own infinitive and bare infinitive forms, which are not listed here. In the following representation, we ignore sound changes, morpho-phonemic rules, and irregular forms (which are listed in Hebrew grammar books).

A.2 Noun Inflections

• Feminine suffixes

– ah: talmid–talmidah (a student).

Number Singular Plural Gender Masculine Feminine Masculine Feminine Person 1 2 3 1 2 3 1 2 3 1 2 3 Past -tiy -ta -φ -tiy -t -ah -nuw -tem -uw -nuw -ten -uw Present -φ -h/-t -ym -wt t-uw y-uw Future ’-φ t-φ y-φ ’-φ t-iy t-φ n-φ t-uw y-uw n-φ t-nah t-nah -uw Imperative -φ -iy -uw -nah

Table A.1: Verb inflections.

155 Number Singular Plural 2 3 2 3 Person 1 1 Masculine Feminine Masculine Feminine Singular Stem -iy -ka -ek -ow -ah -enuw -kem -ken -am -an Plural Stem -ay -eika -ayik -ayw -eiah -einuw -eikem -eiken -eihem -eihen

Table A.2: Possessive pronoun suffixes.

– t: rcini–rcinit (serious).

– iyt: sapar–saparit (a barber).

– et: h. ayal–h. ayelet (a soldier).

– at: qereh. –qerah. at (a bald).

• Number suffixes

– ym: talmid–talmidim (a student– students).

– wt: ´simlah–´smalot (a dress – dresses).

– ayim: ciporen–cipornaim (nail).

• Construct state

– Singular feminine: Suffix change h–t susah–susat (horse)

– Masculine plural: ey affixation zaqen–ziqnei (an old person)

– Internal change within the stem davar–dvar (a thing–thing).

A.3 Short Formative Words

These short formative words and their functions, with some examples, are listed in Table A.3.

A.4 Pronomial Pronoun Suffixes

Table A.2, based on [106, section 5.2.2.2], lists the inflected possessive pronoun suffixes. Accusative and nominative pronouns differ only in singular first person

156

Word Function Example

conjunction wubait (and a house) w tense inversion wayah. alom (and he dreamed) definitene article hami´sh. aqim (the games)

h relativizer ham´sah. qim (that play) interrogative

hat´sah. qw? (will you play?)

preposition bbait (in a house) b definite preposition babait (in the house)

preposition kh. alom (like a dream)

k definite preposition

kah. alom (like the dream) adverb

kme’ah ’anaˇsim (about 100 people)

preposition lhacagah (to a show) l definite preposition

lahacagah (to the show) preposition

m meharca’ah (from a lecture)

relativizer hatapuh ˇsenapal (the apple that fell down)

ˇs . subordinating conjunction

h. aˇsabti ˇseqar (I thought that it was cold) temporal subordinating conjunction

kˇs kˇseˇsama‘ati (when I heard) temporal subordinating conjunction

lkˇs likˇsetiˇsma‘ (when you will hear)

temporal subordinating conjunction miˇseniknas (from when it starts) mˇs subordinating conjunction

miˇsenimca’ (than exists) subordinating conjunction

mb mibanimca’ (of that exists) subordinating conjunction ml mil’ah. iv (from of his brothers)

Table A.3: Short formative words.

157 suffix (niy instead of iy), and in singular masculine third person suffix (ehu and enuw, in addition to wo).

A.5 Inflection and Affixation according to Lexi- cal Category

Table A.4, lists the inflection and the affixation domain, of each lexical category, as extracted from a corpus of 40M tokens.

Table A.4: Parts-of-speech inflections.

POS Gen. Num. Per. Stat. Ten. Def. Pol. Pre. Suf. Se a M Pf b j F A Noun SPg - - V - Pl c k MF C Dh d I DPi

M F S A

Adjective - - V - -

MF P C I

M S A

Numeral F P - - V - - C

MF D amasculine bfeminine cmasculine and feminine dirrelevant esingular fplural gsingular and plural hdual idual and plural jabsolute kconstruct lpossessive pronoun

158 Table A.4 – Continued

POS Gen. Num. Per. Stat. Ten. Def. Pol. Pre. Suf.

M S

Proper F P - - - V - -

Name MF SP

I

M S 1

Pronoun F P 2 - - V - -

MF SP 3 M 1

S A Participle F 3 - V - P

P C

MF any Pm

n M 1 F S o r Verb F 2 - M - - AN P p MF 3 I

Oq

Preposition ------Rs

Adverb ------R

Conjunction ------

Interrogative ------ -

mpast nfuture oimperative pinfinitive qorigin raccusative or nominative pronoun spronomial

159 Table A.4 – Continued

POS Gen. Num. Per. Stat. Ten. Def. Pol. Pre. Suf.

M S A

Quantifier F - - V - - P C

I M P

S 3 Existential F - Bt V - -

P any

MF F

Negation - - - - - V - - 1 M S P

2 Modal F P - B V - -

3 MF SP F any P 1 B u M S P Copula 2 - F - AN v F P N 3 M I M S Title F - - - V - - - P

MF

Prefix - - - - - V - -

Interjection ------ R

tbeinoni upositive vnegative

160 Table A.4 – Continued

POS Gen. Num. Per. Stat. Ten. Def. Pol. Pre. Suf.

Punctuation ------Foreign ------URL ------Expressionw ------

wdate,time,number

161 162 Appendix B

Selected Tagging Guidelines

B.1 Beinoni

B.1.1 Nouns vs. Verbs

Several tests were suggested to distinguish between nouns and participle verbs, as follows:1

1. Time-changing: a participle form which cannot be replaced by its past/future inflection is considered to be a noun or an adjective, otherwise, it is a preset

verb.

(5.45) ⇒

vs.

⇒ *

cemah. mt.aps ‘al haqirot ⇒ cemah. .tipes ‘al haqirot vs.

mt.apes haqirot parah. ⇒ *.tipes haqirot parah. A plant grows/is climbing on the wall ⇒ a plant climbed on the wall vs. The wall climber has bloomed ⇒ The wall climber has bloomed

1The first test was suggested by Blao [19, p. 186], and the rest by Shlonsky [112, pp. 27–28].

163

2. The genitive preposition ˇsel can only precede a complement of a noun,

e.g., hu’ ˇsomer ˇsel mip‘alim (he is a guard of factories).

3. Transitive verbs with no complement should be interpreted as nouns, e.g.

hi’ manhigah (she is a leader). On the other hand, a complement

preceded by the accusative marker ’et is not possible for nouns:

hi’ manhigah ’et haqbucah (she is leading the group).

4. An adjective modifier is only possible for nouns, e.g., hi’ manhigah dgulah (she is a great leader).

B.1.2 Adjectives vs. Verbs

Doron [36] presents several tests to distinguish between adjective and (passive)

participle verbs.

1. The negation prefix bilti modifies adjectives, e.g., bilti mus-

mak (uncertified) vs. * *bilti mˇsudar *(unbroadcast).

2. Words that can appear as complements of the verbs nir’e, notar

(looks, remains) are adjectives, e.g., hu’ nir’e msudar (he

looks tidy) vs. * *zeh nir’e muˇsar *(it looks sung).

3. Participles of the form pa‘ul are usually adjectives (except for those that are listed above), e.g., na‘ul, ka’ub (locked, hurt).

4. When transforming the sentence to past or future, participle forms that

function as adjectives do not change, and the auxiliary verbs hayah,

yihyeh (was, will be) are added – ⇒ ha’iˇs msupar qacar ⇒ ha’iˇshayah msupar qacar (the hair of the man is cut short ⇒ the hair of the man was cut short) – under the same transformation,

participle forms that function as verbs change to the proper tense –

⇒ hasipur msupar bkol ha‘ir ⇒ hasipur ysupar bkol ha‘ir (the story is recounted all over the town ⇒ the story was recounted all over the town).

164

5. Inversion is possible for verbs, but not for adjectives, e.g., nigmar

haproyeqt vs. * *gamur haproyeqt *(finished is the project).

6. The addition of ‘al yedei (by) is always possible for verbs, e.g.,

⇒ * mahalak msurbal ⇒ *mahalak msurbal

‘al yedei hamiplagah (a clumsy movement ⇒ *a clumsy movement by the

party) vs. ha‘ir mutqepet (the city is under attack) ⇒

ha‘ir mutqepet ‘al yedei hacaba’ (the city is under attack

by the army).2

7. Adjectives are gradable and can be modified by words such as yoter,

haki (more, most), e.g., ⇒ hu’ mnahel

maclih. ⇒ hu’ hamnahel haki maclih. (he is a successful manager ⇒ he is

the most successful manager) vs. ⇒ *

hu’ maclih. lnaceh. bkol mi´sh. aq ⇒ *hu’ haki maclih. lnaceh. bkol mi´sh. aq (he manages to win at any game ⇒ *he most manages to win at any game).

B.1.3 Adjectives vs. Nouns

Adjectives can sometimes be used with an implicit nominal head and fill the role

of a noun in the sentence. In these cases, they are still tagged as adjectives.

• hah. akamim hikrizu (the sages announced). The token h. akamim (the sages) should be tagged as a definite noun, since the lexeme

h. akam (sage) is listed in the lexicon as a noun.

• hatmimim naplu qorban lahona’a (the naive were

victims of the fraud). There is no noun entry for the lexeme tamim

(naive) in the lexicon, so the token hatmimim (the naive) will be tagged as a definite adjective.

2

With some exceptions, such as hah. alipah tpurah ‘al yedei h. ayat (the suit is sewn by a tailor).

165

• hapras no‘ad liptor ’et hamukˇsarim mib‘ayot parnasah (The price was formed in order to free the talented people

from livelihood hardship). The token mukˇsarim (talented people) should be tagged as an adjective, due to the absence of noun entry for the

lexeme mukˇsar (talented) in the lexicon.

Even when the adjective is used in a construct state (smih. ut), and the overall

phrase fills the slot of a noun phrase, the adjective is tagged as an adjective, e.g.,

re‘ulei panim h. at.pu ’ezrah. b‘iraq (the masked persons kidnapped a citizen in Iraq). In the case of a participle which is not listed as a noun or as an adjective, the

POS may be determined by the role it fills in the sentence:

• qaninu ˇsnei mt.apsim lamirpeset (we bought two

climbers for the patio) – the token mt.apsim (climbers) is a noun.

• cemah. mt.apes (climb plant) – the token mt.apes (climb) is an adjective.

However, we found such guidance to be complicated for the annotators, so the decision is to tag such forms as participle.

166 Bibliography

[1] Steven P. Abney. Parsing by chunks. In Robert C. Berwick, Steven P. Abney, and Carol Tenny, editors, Principle-Based Parsing: Compu- tation and Psycholinguistics, pages 257–278. Kluwer, Dordrecht, The Netherlands, 1991.

[2] Meni Adler and Michael Elhadad. An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In Proceeding of COLING-ACL-06, Sydney, Australia, 2006.

[3] Meni Adler, Yael Netzer, Yoav Goldberg, David Gabay, and Michael ELhadad. Tagging a hebrew corpus: The case of participles. In Pro- ceeding of the 6th edition of the Language Resources and Evaluation, Marrakech , Morocco, 2008.

[4] Emmanuel Allon. Unvocalized Hebrew Writing. Ben-Gurion University Press, 1995. (in Hebrew).

[5] Ora Ambar. From modality to an emotional situation. Te‘udah, 9:235– 245, 1995. (in Hebrew).

[6] Mark Aronoff. Word Formation in Generative Grammar. MIT Press, 1976.

[7] Mark Aronoff and Kirsten Fudeman. What is Morphology? Blackwell, Malden, 2005.

[8] Isaac Avinery. YAD HALLASHON - Lexicon of Linguistic Problems in the Hebrew Language. Yizra’el, Tel Aviv, , 1964. (in Hebrew).

167 [9] Eitan Avneyon, Raphael Nir, and Idit Yosef. Milon sapir: The En- cyclopedic Sapphire Dictionary. Hed Artsi, Tel-Aviv, Israel, 2002. (in Hebrew).

[10] Michele Banko and Robert C. Moore. Part-of-speech tagging in con- text. In Proceedings of Coling 2004, pages 556–561, Geneva, Switzer- land, Aug 23–Aug 27 2004. COLING.

[11] Roy Bar-Haim, Khalil Sima’an, and Yoad Winter. Choosing an opti- mal architecture for segmentation and pos-tagging of modern Hebrew. In Proceedings of ACL-05 Workshop on Computational Approaches to Semitic Languages, 2005.

[12] Leonard E. Baum. An inequality and associated maximization tech- nique in statistical estimation for probabilistic functions of a Markov process. Inequalities, 3:1–8, 1972.

[13] Mordechai Ben-Asher. The Consolidation of the Normartive Gram- mar. Hakibbutz Hameuchad, Haifa, Israel, 1969. (in Hebrew).

[14] Mordechai Ben-Asher. On the prepositions in Mosern Hebrew. Leˇsonenu, XXXVIII:285–294, 1974. (in Hebrew).

[15] Na’ama Ben-Mordechai. Named entities recognition in Hebrew. Mas- ter’s thesis, Ben-Gurion University of the Negev, Beer-Sheva, Israel, 2005. (in Hebrew).

[16] Julian Benello, Andrew W. Mackie, and James A. Anderson. Syntactic category disambiguation with neural networks. Computer Speech and Language, 3:203–217, 1989.

[17] R. A. Berman. Productivity in the lexicon: New-word formation in Modern Hebrew. Folia Linguistica, XXI/2-4:425–461, 1987.

[18] Ruth Berman. The case of (s)vo language: Subjectless constructions in Modern Hebrew. Language, 56:759–776, 1980.

168 [19] Yehoshua Blao. Syntax Fundamentals. Hebrew Institute for Written Education, , 1966. in Hebrew.

[20] Leonard Bloomfield. Language. Holt, New York, 1933.

[21] D. Bolinger. Aspects of Language. Jarcourt Brace Janovich, New York, second edition, 1975.

[22] G. Johnson Botterweck and Helmer Ringgren. Theological Dictionary of the Old Testament. William B. Eerdmans Publishing Company, Grand Rapids Michigan, 1975. translated by John T. Wills.

[23] Eric Brill. Transformation-based error-driven learning and natural lan- guge processing: A case study in part-of-speech tagging. Computa- tional Linguistics, 21:543–565, 1995.

[24] Carl Brockelmann. Grundriss der vergleichenden Grammatik der sem- mitischen Sprachen. Georg Olms Verlagsbuchhandlung, Hildesheim, 1966. (in German).

[25] Tim Buckwalter. Buckwalter arabic morphological analyzer, version 2.0, 2004.

[26] David Carmel and Yoelle S. Maarek. Morphological disambiguation for Hebrew search systems. In Proceeding of NGITS-99, pages 312–326, 1999.

[27] Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz. Equations for part-of-speech tagging. In The eleventh Na- tional Conference on Artificial Intelligence, pages 784–789, 1993.

[28] Stanley F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Harvard University, Cambridge, MA, 1996.

[29] Yaacov Choueka. Rav-Milim - A Comprehensive Dictionary of Modern Hebrew, literally: Multi-Words. C.E.T, Miskal and Steimatzky, Tel- Aviv, Israel, 1997.

169 [30] Yaacov Choueka, Uzi Freidkin, Hayim A. Hakohen, , and Yael Zachi- Yannay. Rav Milim: A Comprehensive Dictionary of Modern Hebrew. Steimatsky, Tel-Aviv, Israel, 1997. (in Hebrew).

[31] Mathias Creutz and Krista Lagus. Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process., 4(1):3, 2007.

[32] Walter Daelemans, Jakub Zavrel, Peter Berck, and Steven Gillis. Mbt: A memory-based part of speech tagger generator. In WVLC, volume 4, pages 14–27, 1996.

[33] Ferdinand de Haan. Typological approaches to modality in approaches to modality. In William Frawley, editor, Approaches to Modality, pages 27–69. Mouton de Gruyter, Berlin, 2005.

[34] Evangelos Dermatas and George Kokkinakis. Automatic stochas- tic tagging of natural language texts. Computational Linguistics, 21(2):137–163, 1995.

[35] Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceeding of HLT-NAACL-04, 2004.

[36] Edit Doron. The passive participle. Hebrew Linguistics, 47:39–62, 2000. (in Hebrew).

[37] Oswald Ducrot and Tzvetan Todorov. Dictionnaire encyclop´edique des sciences du langage. Editions´ de Seuil, Paris, 1972.

[38] Oswald Ducrot and Tzvetan Todorov. Encyclopedic dictionary of the science of language. John Hopkins University, Baltimore, MD, 1979.

[39] Kevin Duh and Katrin Kirchhoff. Pos tagging of dialectal Arabic: A minimally supervised approach. In Proceedings of ACL-05 Workshop on Computational Approaches to Semitic Languages, 2005.

170 [40] Michael Elhadad, Yael Netzer, David Gabay, and Meni Adler. He- brew morphological tagging guidelines. Technical report, Ben-Gurion University, Dept. of Computer Science, 2005.

[41] David Elworthy. Does Baum-Welch re-estimation help taggers? In Proceeding of ANLP-94, 1994.

[42] Avraham Even-Shoshan. Even Shoshan’s Dictionary - Renewed and Updated for the 2000s. Am Oved, Kineret, Zmora-Bitan, Dvir and Yediot Aharonot, 2003. (in Hebrew).

[43] Knowledge Center for Processing Hebrew. Hebrew mor- phological analyzer. http://yeda.cs.technion.ac.il: 8088/XMLMorphologicalAnalyzer.

[44] W. N. Francis. A tagged corpus - problems and prospects. In S. Green- baum, G. Leech, and J. Svartvic, editors, Studies in English Linguistics for Randolph Quirk, pages 192–209. Longman, London and New York, 1979.

[45] Roger Garside, Geoffrey Leech, and Geoffrey Sampson. The computa- tional analysis of English. A corpus-based approach. Longman, Lon- don, 1987.

[46] Friedrich H. W. Gesenius. Hebrew Grammar. The Clarendon Press, Oxford, 1976. Edited and enlarged by E. Kautzsch, English edition by A. E. Cowley.

[47] H. A. Gleason. An Introduction to Descriptive Linguistics. Holt, Rine- hart and Winston, 1961.

[48] Lewis Glinert. The Grammar of Modern Hebrew. Cambridge Univer- sity Press, NY, USA and Melbourne Australia, 1989.

171 [49] Yoav Goldberg and Michael Elhadad. Svm model tampering and an- chored learning: A case study in hebrew np chunking. In Proceeding of ACL 2007, Prague, Czech Republic, 2007.

[50] Yoav Goldberg, Michael Elhadad, and Meni Adler. Noun phrase chunking in hebrew influence of lexical and morphological features. In Proceeding of COLING-ACL-06, Sydney, Australia, 2006.

[51] Moshe Goshen-Gotshtein, Ze’ev Livne, and Shlomo Shpan. The Prac- tical Hebrew Grammar. Schoken, Jerusalem, 1966. in Hebrew.

[52] Moshe Goshen-Gottstein. Semitic morphological structures: The basic morphological structure of Biblical Hebrew. In H. B. Rosen, editor, Studies in Egyptology and Linguistics in Honour of H. J. Polotsky, pages 104–116. Israel Exploration Society, Jerusalem, Israel, 1964.

[53] Barbara B. Greene and Gerald M. Rubin. Automatic grammatical tagging of English. Technical report, Brown University, Providence, RI, 1971.

[54] Yehuda Gur. The Hebrew Language Dictionary. Dvir, Tel-Aviv, Israel, 1946. (in Hebrew).

[55] Nizar Habash. Arabic morphological representations for machine translation. In Antal van den Bosch and Abdelhadi Soudi, editors, Arabic Computational Morphology: Knowledge-based and Empirical Methods. Springer, 2007.

[56] Nizar Habash and Owen Rambow. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Pro- ceeding of ACL-05, 2005.

[57] Nizar Habash and Owen Rambow. Magead: A morphological ana- lyzer and generator for the arabic dialects. In Proceedings of the 21st

172 International Conference on Computational Linguistics and 44th An- nual Meeting of the Association for Computational Linguistics, pages 681–688, Sydney, Australia, July 2006. Association for Computational Linguistics.

[58] D. Z. Hakkani-Tur. Statistical Modeling of Agglutinative Languages. PhD thesis, Bilkent University, 2000.

[59] M. A. K. Halliday. An introduction to functional grammar. Edward Arnold, London and Baltimore, second edition, 1985.

[60] Nadav Har’el and Dan Kenigsberg. Hebrew spell checker tool.

[61] Nadav Har’el and Dan Kenigsberg. HSpell - the free Hebrew spell checker and morphological analyzer. Israeli Seminar on Computational Linguistics, December 2004, 2004.

[62] Charles F. Hockett. Two models of grammatical description. Word, 10:210–234, 1954.

[63] Charles F. Hockett. A Course in Modern Linguistics. The MacMillan Company, New York, 1958.

[64] R. Jackendoff. Morphological and semantic regularities in the lexicon. Language, 51:639–671, 1975.

[65] Daniel Jurafsky and James H. Martin. Speech and language processing. Prentice-Hall, 2000.

[66] Yaakov Knaani. The Hebrew Language Lexicon. Masada, Jerusalem, Israel, 1960. (in Hebrew).

[67] Ludwig Koehler and Walter Baumgartner. The Hebrew and Aramaic Lexicon of the Old Testament. Brill, Leiden - New York - Koln, 1994. translated and edited by M.E.J. Richardson.

173 [68] Alexander Kohut. Aruch Completum auctore Nathane filio Jechielis. Hebraischer Verlag - Menorah, Wien-Berlin, 1926. (in Hebrew).

[69] Ziona Kopelovich. Modality in Modern Hebrew. PhD thesis, University of Michigan, 1982.

[70] Taku Kudo and Yuji Matsumato. Use of support vector learning for chunk identification. In Proceedings of CoNLL-00 and LLL-00, Lisbon, Portugal, 2000.

[71] J. Kupiec. Robust part-of-speech tagging using hidden Markov model. Computer Speech and Language, 6:225–242, 1992.

[72] Gennady Lembersky. Named entities recognition; compounds: ap- proaches and recognitions methods. Master’s thesis, Ben-Gurion Uni- versity of the Negev, Beer-Sheva, Israel, 2001. (in Hebrew).

[73] Moshe Levinger. Morhphological disambiguation in hebrew. Master’s thesis, Technion, Haifa, Israel, 1992. (in Hebrew).

[74] Moshe Levinger, Uzi Ornan, and Alon Itai. Learning morpholexical probabilities from an untagged corpus with an application to Hebrew. Computational Linguistics, 21:383–404, 1995.

[75] Yitschak Livny and Moshe Kochva. Hebrew Grammar. ‘ever, Jerusalem, 1965. in Hebrew.

[76] Ralph B. Long. The Sentence and its Parts. University of Chicago Press, Chicago and London, 1961.

[77] Christopher D. Manning and Hinrich Schutze. Foundation of Statistical Language Processing. MIT Press, 1999.

[78] Saib Mansour, Khalil Sima’an, and Yoad Winter. Smoothing a lexicon- based pos tagger for Arabic and Hebrew. In ACL07 Workshop on Com- putational Approaches to Semitic Languages, Prague, Czech Republic, 2007.

174 [79] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marchinkiewicz. Building a large annotated corpus of English: The penn treebank. Computational Linguistics, 19:313–330, 1993.

[80] W. Mayerthaler. Morphological Naturalness. Karoma, Ann Harbor, MI, 1988. (translated by J. Seidler).

[81] Bernard Merialdo. Tagging English text with probabilistic model. Computational Linguistics, 20:155–171, 1994.

[82] Andrei Mikheev. Automatic rule induction for unknown-word guess- ing. Computational Linguistics, 23(3):405–423, 1997.

[83] P Miller. Postlexical cliticization vs. affixation: Coordination criteria. In C. Canakis, G. Chan, and J. Denton, editors, Papers from the 28th Regional Meeting of the Chicago Linguistic Society, pages 382–396. The Chicago Linguistic Society, Chicago, 1992.

[84] K. P. Murphy. Dynamic Baayesian Networks: Representation, Infer- ence and Learning. PhD thesis, University of California, Berkeley, 2002.

[85] Simcha Nahir. The Principles of the Senetnce Theory. The Hebrew Realistic School, Haifa, 1963. in Hebrew.

[86] Tetsuji Nakagawa. Chinese and Japanese word segmentation using word-level and character-level information. In Proceedings of the 20th international conference on Computational Linguistics, Geneva, 2004.

[87] Yael Netzer. Design and evaluation of a functional input specification language for the generation of bilingual nominal expressions. Master’s thesis, Ben-Gurion University, 1997.

[88] Yael Netzer, Meni Adler, David Gabay, and Michael ELhadad. Can you tag the modal? you should! In Proceeding of COLING-ACL-07, Prague, Czech, 2007.

175 [89] Raphael Nir. Introduction to Linguistics. The Open University of Israel, Tel-Aviv, Israel, 1989.

[90] Raphael Nir. Word-Formation in Modern Hebrew. The Open Univer- sity of Israel, Tel-Aviv, Israel, 1993.

[91] Noam Ordan. Generation rules documentation. Technical report, Knowledge Center for Processing Hebrew, 2007. (in Hebrew).

[92] Uzi Ornan. The parts of speech. L˘eˇson´enu, XXV:35–56, 1960. (in Hebrew).

[93] Uzi Ornan. Hebrew in Latin script. L˘eˇson´enu, LXIV:137–151, 2002. (in Hebrew).

[94] Uzzi Ornan. Ways of innovating words. In Evolution and Renewal: Trades in the Development of the Hebrew Language, pages 77–101, Jerusalem, 1996. The Israel Academy of Sciences and Humanities. (in Hebrew).

[95] Uzzi Ornan. The Final Word - Mechanism for Hebrew Word Genera- tion. Haifa University Press, Haifa, Israel, 2003. (in Hebrew).

[96] A. Pirkola. Morphological typology of languages for ir. Journal of Documentation, 57:330–348, 2001.

[97] Andrew Radford. Syntactic Theory and the Structure of English - a Minimalist Approach. Cambridge University Press, Cambridge, 1997.

[98] Lance A. Ramshaw and Mitchel P. Marcus. Text chunking using transformation-based learning. In Proceedings of the 3rd ACL Work- shop on Very Large Corpora, Cambridge, 1995.

[99] Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, volume 1, pages 133–142, 1996.

176 [100] D. Ravid. Internal structure constraints on new-word formation de- vices in Modern Hebrew. Folia Linguistica, XXIV/3-4:289–347, 1990.

[101] Graeme D. Ritchie, Graham J. Russell, Alan W. Black, and Stephen G. Pulman. Computational Morphology - Practical Mechanisms for the English Lexicon. The MIT Press, Cambridge MA, London, England, 1992.

[102] Haim B. Rosen. Contemporary Hebrew. Mouton, The Hague, Paris, 1977.

[103] Khalil Sima’an Roy Bar-Haim and Yoad Winter. Part-of-speech tag- ging of modern hebrew text. Natural Language Engineering, 2007. (forthcoming).

[104] Beatrice Santorini. Part-of-speech tagging guidelines for the Penn Treebank Project. 3rd revision;. Technical report, Department of Com- puter and Information Science, University of Pennsylvania, 1995.

[105] Helmut Schmid. Probabilistic part-of-speech tagging using decision trees. In International conference on new methods in language pro- cessing, pages 44–49, 1994.

[106] Ora (Rodrigue) Schwarzald. Studies in Hebrew Morphology. Volumes 1–4. The Open University of Israel, Tel-Aviv, Israel, 2002. (in Hebrew).

[107] Anna Maria Di Sciullo and Edwin Williams. On the Definition of Word. MIT Press, Cambridge, MA, 1987.

[108] Erel Segal. Hebrew morphological analyzer for Hebrew undotted texts. Master’s thesis, Technion, Haifa, Israel, 2000. (in Hebrew).

[109] Moshe Tsvi Segal. The Grammar of the Talmudic Language. Dvir, Tel-Aviv, 1936. (in Hebrew).

177 [110] Fei Sha and Fernando Pereira. Shallow parsing with conditional ran- dom fields. technical report. Technical report, University of Pennsyl- vania, 2003.

[111] Danny Shacham and Shuly Wintner. Morphological disambiguation of hebrew: A case study in classifier combination. In Proceeding of EMNLP-07, Prague, Czech, 2007.

[112] Ur Shlonsky. Clause Structure and Word Order in Hebrew and Arabic. Oxford Universirt Press, New York Oxford, 1997.

[113] Khalil Sima’an, Alon Itai, Alon Altman Yoad Winter, and Noa Nativ. Building a tree-bank of modern Hebrew text. Journal Traitement Au- tomatique des Langues (t.a.l.), 2001. Special Issue on NLP and Corpus Linguistics.

[114] James Sledd. A Short Introduction to English Grammar. University of Texas, Scott, Foresman and Company, 1959.

[115] Andrew Spencer. Morphological Theory. Basil Blackwell, London, 1991.

[116] Andrew Spencer. Morphology. In Mark Aronoff and Janie Rees- Miller, editors, The Handbook of Linguistics, pages 213–237. Blackwell, Malden, Mass, 2000.

[117] Talmud Babli, Re’em, Vilna, 1883.

[118] Scott M. Thede and Mary P. Harper. A second-order hidden Markov model for part-of-speech tagging. In Proceeding of ACL-99, 1999.

[119] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In HLT-NAACL, 2003.

178 [120] R. Weischedel, R. Schwartz, J. Palmucci, M. Meteer, and L. Ramshaw. Coping with ambiguity and unknown words through probabilistic mod- els. Computational Linguistics, 19:359–382, 1993.

[121] Shuly Wintner. Definiteness in the Hebrew noun phrase. Journal of Linguistics, 36:319–363, 2000.

[122] M. Yoeli. Hebrew Syntax. Yesodot, Tel-Aviv, 1963. in Hebrew.

[123] Shlomo Yona. A discussion on the formative letters combination in modern Hebrew. Technical report, Haifa University, 2004.

[124] Shlomo Yona. A finite-state based morphological analyzer for Hebrew. Master’s thesis, Haifa University, 2004.

[125] Shlomo Yona and Shuly Wintner. A finite-state morphological gram- mar of hebrew. Natural Language Engineering, 2007. (forthcoming).

[126] James J. S. Yoon and Elabbas Benmamoun. Basic concepts and choices in the theory of morphology, 2002.

[127] Yitzhak Zadka. The Practical Hebrew Grammar. Qiryat Seper, Jerusalem, 1995. (in Hebrew).

[128] Yitzhak Zadka. The single object ”rider” verb in current Hebrew: Classification of modal, adverbial and aspectual verbs. Te‘udah, 9:247– 271, 1995. (in Hebrew).

[129] Tong Zhang, Fred Damerau, and David Johnson. Text chunking based on a generalization of winnow. Journal of Machine Learning Research, 2:615–637, 2002.

[130] Arnold Zwicky. Some choices in the theory of morphology. In Robert Levine, editor, Formal grammar: Theory and Implementation, pages 327–371. Oxford University Press, New York, 1992.

179

X

dibeÐeØÜeÒ gÞÔÒd

ibeÐeØÜeÒd

ibeÐeØÜeÒd ÖiÜkÒd

(a contract)

(her husband) (watch)

(a woman)

(they cut hair) (count!)

7.5%

23%

90

93

200,000

:ÞiÜaÖÐ ibeÐeØÜeÒ geÞiÔ

ÞigÔeÒ¹`Ð ÞihÕkehÕ dÝib

ÑiÐiÒ ÞÕ Õ eaÒe

:ÞiÜaÖÐ ibeÐeØÜeÒ geÞiÔ

ÞigÔeÒ¹`Ð ÞihÕkehÕ dÝib

ÑiÐiÒ ÞÕ Õ eaÒe