Can You Tag the Modal? You Should.

Yael Netzer and Meni Adler and David Gabay and Michael Elhadad Ben Gurion University of the Negev Department of Computer Science POB 653 Be’er Sheva, 84105, Israel (yaeln|adlerm|gabayd|elhadad)@cs.bgu.ac.il

Abstract 1 Introduction

In this paper we address one linguistic issue that was Computational linguistics methods are typ- raised while tagging a Hebrew corpus for part of ically first developed and tested in English. speech (POS) and morphological information. Our When applied to other languages, assump- corpus is comprised of short news stories. It in- tions from English data are often applied cludes roughly 1,000,000 tokens, in articles of typ- to the target language. One of the most ical length between 200 to 1000 tokens. The arti- common such assumptions is that a “stan- cles are written in a relatively simple style, with a dard” part-of-speech (POS) tagset can be high token/word ratio. Of the full corpus, a sam- used across languages with only slight vari- ple of articles comprising altogether 100,000 tokens ations. We discuss in this paper a specific is- was assembled at random and manually tagged for sue related to the definition of a POS tagset part of speech. We employed four students as tag- for , as an example to clar- gers. An initial set of guidelines was first composed, ify the method through which such varia- relying on the categories found in several dictionar- tions can be defined. It is widely assumed ies and on the Penn treebank POS guidelines (San- that Hebrew has no syntactic category of torini, 1995). Tagging was done using an automatic modals. There is, however, an identified tool1. We relied on existing computational lexicons class of words which are modal-like in their (Segal, 2000; Yona, 2004) to generate candidate tags semantics, and can be characterized through for each word. As many words from the corpus were distinct syntactic and morphologic criteria. either missing or tagged in a non uniform manner in We have found wide disagreement among the lexicons, we recommended looking up missing traditional dictionaries on the POS tag at- words in traditional dictionaries. Disagreement was tributed to such words. We describe three also found among copyrighted dictionaries, both for main approaches when deciding how to tag open and closed set categories. Given the lack of such words in Hebrew. We illustrate the im- a reliable lexicon, the taggers were not given a list pact of selecting each of these approaches of options to choose from, but were free to tag with on agreement among human taggers, and on whatever tag they found suitable. The process, al- the accuracy of automatic POS taggers in- though slower and bound to produce unintentional duced for each method. We finally recom- mistakes, was used for building a lexicon, and to mend the use of a “modal” tag in Hebrew refine the guidelines and on occasion modify the and provide detailed guidelines for this tag. POS tagset. When constructing and then amending Our overall conclusion is that tagset defini- the guidelines we sought the best trade-off between tion is a complex task which deserves appro- priate methodology. 1http://wordfreak.sourceforge.net accuracy and meaningfulness of the categorization, (Knaani, 1960), (5) HMA (Carmel and Maarek, and simplicity of the guidelines, which is important 1999), (6) Segal (Segal, 2000), (7) Yona (Yona, for consistent tagging. 2004), (8) Hebrew Treebank (Sima’an et al., 2001). Initially, each text was tagged by four different As can be seen, eight different POS tags were people, and the guidelines were revised according suggested by these dictionaries: adJective (29.6%), to questions or disagreements that were raised. As adveRb (25.9%), Verb (22.2%), Auxilary verb the guidelines became more stable, the disagreement (8.2%), Noun (4.4%), parTicle (3.7%), Preposition rate decreased, each text was tagged by three peo- (1.5%), and Unknown (4.5%). The average number ple only and eventually two taggers and a referee of options per word is about 3.3, which is about 60% that reviewed disagreements between the two. The agreement. For none of the words there was a com- disagreement rate between any two taggers was ini- prehensive agreement, and the PoS of only seven tially as high as 20%, and dropped to 3% after a few words (43.75%) can be determinded by voting (i.e., rounds of tagging and revising the guidelines. there is one major option). Major sources of disagreements that were identi- In the remainder of the paper, we investigate the fied, include: existence of a modal category in Modern Hebrew, Prepositional phrases vs. prepositions In Hebrew, by analyzing the characteristic of these words, from b,c,l,m2 – can be attached a morphological, syntactic, semantic and practical ב,כ,ל,מ – formative letters to a noun to create a short prepositional phrase. In point of view. The decision whether to introduce some cases, such phrases function as a preposition a modal tag in a Hebrew tagset has practical con- and the original meaning of the noun is not clearly sequences: we counted that over 3% of the tokens felt. Some taggers would tag the word as a prepo- in our 1M token corpus can potentially be tagged sitional prefix + noun, while others tagged it as a as modals. Beyond this practical impact, the deci- b’iqbot (following), that sion process illustrates the relevant method through בעקבות ,.preposition, e.g .b-iqbot (in the footsteps which a tagset can be derived and fine tuned ב־עקבות can be tagged as of). Adverbial phrases vs. Adverbs the problem is simi- 2 Modality in Hebrew bdiyuq (exactly), can בדיוק ,.lar to the one above, e.g be tagged as b-diyuq (with accuracy). Semantically, Modus is considered to be the attitude Participles vs. Adjectives as both categories can on the part of the speaking subject with regard to modify nouns, it is hard to distinguish between its content (Ducrot and Todorov, 1972), as opposed to the Dictum which is the linguistic realization of mabat. m’ayem (a threatening מבט מאיים ,them, e.g m’ayem is unclear. a predicate. While a predicate is most commonly מאיים stare) - the category of Another problem, on which the remainder of the ar- represented with a verb, modality can be uttered in ticle focuses, was a set of words that express modal- various manners: adjectives and adverbs (definitely, ity, and commonly appear before verbs in the infini- probable), using thought/belief verbs, mood, into- tive. Such words were tagged as adjectives or ad- nation, or with modal verbs. The latter are recog- verbs, and the taggers were systematically uncertain nized as a grammatical category in many languages about them. (modals), e.g., can, should and must in English. Beside the disagreement among taggers, there From the semantic perspective, modality is was also significant disagreement among Modern coarsely divided into epistemic modality (the Hebrew dictionaries we examined, as well as com- amount of confidence the speaker holds with ref- putational analyzers and annotated corpora. Ta- erence to the truth of the proposition) and deontic ble 1 lists the various selected POS tags for these modality (the degree of force exerted on the sub- words, as determined by: (1) Rav Milim (Choueka ject of the sentence to perform the action) views (de et al., 1997), (2) Sapir (Avneyon et al., 2002), (3) Haan, 2005). Even-Shoshan (Even-Shoshan, 2003), (4) Knaani Modal expressions do not constitute a syntactic class in Modern Hebrew (Kopelovich, 1982). In 2Transcription according to (Ornan, 2002) her work, Kopelovich reviews classic descriptive Word Example 1 2 3 4 5 6 7 8 יש לשים לב לניסוח יש yesˇ yesˇ lasim´ leb lanisuh. R N N R N A R V should Attention should be paid to the wording אין לשים לב לניסוח אין N ’ein ’ein lasim´ leb lanisuh R U U P P R V . R shouldn’t Attention should not be paid to the wording הציבור חייב להבין את העניין חייב h. ayab hacibur h. ayab lhabin ’et ha‘inyan J J J J J J J V must The public should be made aware of this issue מותר לה לצאת לטיול מותר mutar mutar lah lacet lt.iyul R N J R J A V J allowed She is allowed to go on a trip אסור לה לצאת לטיול ביום ראשון אסור J ’asur ’asur lah lacet ltiyul byom risonˇ R R R R J A J . V forbidden She is not allowed to go on a trip on Sunday אפשר לרמוז רמזים אפשר ’epsrˇ ’epsrˇ lirmoz rmazim U R R R T A R V may Giving hints is allowed נשים אמורות ללבוש רעלות אמור ’amur nasimˇ ’amurot lilbosˇ r‘alot J A J J J A J V supposed Women are supposed to wear veils במו״מ צריך לעמוד על שלך צריך carik bmw”m carik la‘amod ‘al selkaˇ J J R J J A J V should In negotiation you should keep strong ניתן לפתור בעיה מכאיבה זו ניתן nitan nitan liptor b‘ayah mak’ibah zo U V V V V V V V can This troublesome problem can be solved הכלב עלול לנשוך עלול J ‘alul hakeleb ‘alul linsokˇ J J J J A J V N may The dog may bite כדאי לשאול האם הדלת עשויה היטב כדאי kda’y kda’y lis’olˇ ha’im hadelet ‘asuyah´ heit.eb R R R R J A R J worthwhile It is worth asking whether the door is well built מוטב להיות בשקט ולהנות מוטב mutab mutab lihyot beseqetˇ . wulhnot R R R R T T V V better Better to keep quiet and enjoy הוא היה מסוגל לראותו בבית הלבן מסוגל J msugal hu’ hayah msugal lir’oto babait halaban J R J J A J V V able could envision him sitting in the White House אנשים יכולים לתרום תרומות יכול yakol ’anasimˇ ykolim litrom trumot V V V J V A V V can People can make contributions אכפת לך ללכת? אכפת V V ’ikpat ’ikpat lka laleket? U U T T R V R R care/mind Do you mind going ראוי לשלם על שרות זה ראוי V ra’uy ra’uy lsalemˇ ‘al serutˇ zeh R R R R J R J J should This service deserves to be paid for

Table 1: Parts of speech of selected words publications on the syntax of Hebrew and claims The Objective-Subjective plane is what that these works (Ornan, Rubinstein, Azar, Rosen, Kopelovich calls the perception of the world. Aronson-Berman and Maschler)3 do not provide a Objectivity is achieved using different tenses of satisfying description or explanation of the mat- to-be in conjunction with the modal (including tense ter. In this section we review three major ap- of modal if it is a verb), and their order subjective proaches to modality in Hebrew - the first is seman- vs. objective: דוד היה צריך לנסוע לתל אביב (tic (Kopelovich), the second is semantic-syntactic (3 (Zadka) and the third is purely morphologico- dawid haya carik lisw‘ ltel ’abib syntactic (Rosen). David was have to-drive to-Tel Aviv Kopelovich provides three binary dimensions that David had to drive to Tel Aviv describe the modal system in Hebrew: Personal - כדי להעביר את ההחלטה, צריך היה לכנס את כל העובדים (Impersonal, Modality - Modulation and Objective - (4 Subjective plane. The Personal-Impersonal system kdei lha‘abir ’et hahah. lata, is connected to the absence or presence of a surface carik haya lkanes ’et kol ha‘obdim subject in the clause. A personal modal has a gram- In-order to-pass ACC the-decision, matical subject: should to-assemble ACC all the-employees In order to obtain a favorable vote on this decision, דוד צריך להסיע את אמו (1) all of the employees had to be assembled. dawid carik lhasi‘ ’et ’imo David should to-drive ACC mother-POSS Zadka (1995) defines the class of single-argument 4 David should drive his mother bridge verbs , i.e., any verb or pro-verb that can have an infinitive as a subject or object, and that does An impersonal modal has no grammatical subject, not accept a subordinate clause as subject or object: and modality predicates the entire clause. [subject] אסור לעשן (5) צריך להסיע את אמו לעבודה (2) carik lhasi‘ ’et ’imo la‘abodah ’sur l‘asenˇ should to-drive ACC mother-POSS to-the-work Forbidden to-smoke His mother should be driven to work It is forbidden to smoke [object] הוא רצה/התחיל לשחק (Kopelovich makes no distinction between the vari- (6 ous syntactic categories that the words may belong hua racah/hth. il lsahˇ . eq to, and interchangeably uses examples of words like He wanted/started to-play mutar, yes,ˇ ’epsarˇ [adverb, existen- He wanted/started to play מותר, יש, אפשר יוסף התחיל/עמד/מסוגל לקרוא את הדו״ח במלואו (tial, participle respectively]. (7 The Modality-Modulation plane, according to Yosep hith. il/’amad/msugal liqro’ the functional school of Halliday (Halliday, 1985), ’et hado”h. bimlo’o. refers to the interpersonal and ideational functions Yosef began/is-about/is-capable to-read of language: Modality expresses the speaker’s own ACC the report entirely. mind (epistemic modality - possibility, probability Yosef began/is-about/(is-capable) to read (of reading) .alul laredet gesemˇ the report entirely‘ עלול לרדת גשם מחר (and certainty mah. ar (it may rain tomorrow). Modulation par- יוסף התחיל/עמד/מסוגל שיקרא את הדו״ח במלואו* (8) ticipates in the content clause expressing external *Yosep hithil/’amad/msugal siqra’ˇ conditions in the world (deontic modality - permis- . ’et hado”h bimlo’o. . אתה יכול :(sion, obligation, ability and inclination *Yosef started/was-about/is-capable that-he-read ata yakol lhathil ‘aksawˇ (you can start’ להתחיל עכשיו . ACC the report entirely. now). Modality does not carry tense and cannot be negated, while modulation can be modified by tense Zadka classifies these verbs into seven semantic cat- and can be negative. egories: (1) Will (2) Manner (3) Aspect (4) Ability

פועל רוכב חד־ ,3For reference see (Kopelovich, 1982), see below for 4“Ride Verb” in Zadka’s terminology מושאי Rosen’s analysis (5) Possibility/Certainty (6) Necessity/Obligation From the morphological point of view, one may and (7) Negation. Categories 1, 4, 5, 6 and 7 are characterize impersonals by a non-inflectional be- ,asur’ אסור ,mutar מותר ,ein’ אין ,ˇyes יש ,.considered by Zadka to include pure modal verbs, havior, e.g ikpat. All of these words do’ אכפת, ˇepsar’ אפשר .e.g., alethic and deontic verbs In his paper, Zadka defines classification criteria not inflect in number and gender with their argu- that refer to syntactic-semantic properties such as: ment. But this criterion leaves out all of the gender- -msu מסוגל ,ra’uy ראוי ,.can the infinitive verb be realized as a relative clause, quantity inflected words, e.g ,yakol יכול ,carik צריך ,amur’ אמור ,alul‘ עלול ,are the subject of the verb and its complement the gal same, can the infinitive be converted to a gerund, which are all classified as modals by Zadka. On animacy of subject; deep semantic properties – argu- the other hand, including all the gender-quantity in- ment structure and selectional restrictions, the abil- flected words with infinite or relative clause com- ity to drop a common subject of the verb and its plements as modals, will include certain adjectives, zkut זכות ,.musmak (certified), nouns, e.g מוסמך ,.complement, factuality level (factual, non-factual, e.g nimna‘ (avoid), as נמנע ,.counter-factual); and morphological properties. (credit), and participles, e.g Will, Manner and Aspectual verbs as Zadka de- well. It appears that Zadka’s classification relies pri- fines are not considered modals by Kopelovich since marily on semantic criteria. they can be inflected by tense (with the excep- Rosen (1977, pp. 113-115) defines a syntactic cat- atid (should). egory he calls impersonals. Words in this category‘ עתיד,(amur (supposed’ אמור tions of -msugal occur only as the predicative constituent of a sen מסוגל ,(yakol (can יכול Ability verbs are (can,capable) [participle]. They have both an ani- tence with an infinitive or a subordinate clause argu- mate actor as a subject and an infinitive as a comple- ment. Past and future tense is marked with the aux- -hayah (to-be). In addition, imperson היה ment, with the same deep subject. These verbs are ilary verb כדאי :counter-factual. als cannot function as predicative adjectives ikpat’ אכפת ,(mutab (better מוטב ,(kda’i (worthwhile צריך ,(mukrak (must מוכרח Certainty verbs include .(yakol (care/mind יכול ,(ne’elac (be forced to נאלץ ,(carik (should asuy´ (may), Personal reference can be added to the clause‘ עשוי ,(hekreh. i (necessary הכרחי ,(can) -l dative prepo ל capuy (expected). They rep- (governed by the infinitive) with the צפוי ,(alul (might‘ עלול resent the alethic and epistemic necessity or pos- sition: כדאי לי לשתות(sibility of the process realized in the clause. All (10 of them cannot be inflected morphologically. The kda’y li listotˇ modal predicates the whole situation in the proposi- worthwhile to-me to-drink tion, and may be subjective (epistemic) or objective It is worthwhile for me to drink (alethic). The subject of these verbs coreferences with the subject of the modal: 2.1 Criteria to Identify Modal-like Words in Hebrew אני מוכרח לקנות מכונית (9) ’ani mukrak liqnot mkonit We have reviewed three major approaches to catego- I must to-buy car rizing modals in Hebrew: I must buy a car Semantic - represented mostly in Kopelovich’s work, modality is categorized by three dimensions of semantic attributes. Since her claim is that there חייב ,.Necessity/Obligation includes adjectives – e.g -rasa’yˇ (allowed), gerunds – is no syntactic category of modality at all, this ap רשאי ,(h. ayab (must proach ‘over-generates’ modals and includes words מותר ,(asur (forbidden’ אסור ,(mukrah. (must מוכרח yakwl (can) 5. Ne- that from any other syntactic or morphologic view יכול mutar (allowed) and the verb cessity verbs/proverbs present deontic modality, and fall into other parts of speech. all clauses share, in Zadka’s view - a causing partic- Syntactic-semantic - Zadka classifies seven sets of ipant that is not always realized in the surface. verbs and pro-verbs following syntactic and seman- tic criteria. His claim is that modality actually is אין and יש 5as well as nouns and prepositions - among them yesˇ and ’ein - according to Zadka marked by syntactic characteristics, which can be identified by structural criteria. However, his eval- had to face while building the corpus, we then indi- uation mixes semantics with syntactic attributes. cate quantitatively the impact of the modal tag on Morphological-syntactic - Rosen’s definition of Im- the practical problem of tagging. personals is strictly syntactic/morpholgical and does מה אכפת לי ”not try to characterize words with modality. Conse- 3.1 ”What do I care quently, words that are usually considered modals, One of the words tagged as a modal in our corpus ikpat – is not considered thus far’ אכפת asur (forbid- - the word’ אסור are not included in his definition -yakol (can). to be a modal. However, at least in some of its in יכול ,(mutar (allowed מותר ,(den stances it fits our definition of modal, and it can also 3 Proposed Modal Guidelines to Identify be interpreted as modality according to its sense. The only definition that is consistent with our ob- The variety of criteria proposed by linguists reflects servation is Rosen’s impersonals. the disagreements we identified in lexicographic Looking back at its origins, we checked the His- work about modal-like words in Hebrew. For a com- 6 putational application, all words in a corpus must torical Lexicon of the , the word was used in the medieval period in the Talmud אכפת be tagged. Given the complex nature of modality in Hebrew, should we introduce a modal tag in our and the Mishna, where it only appears in the follow- tagset, or instead, rely on other existing tags? We ing construction: מה אכפת לך(have decided to introduce a modal tag in our He- (11 brew tagset. Although there is no distinct syntac- mah ’ikpat lk tic category for modals in Hebrew, we propose the what care to-you following criteria: (i) They have an infinitive com- what do you care plement or a clausal complement introduced by the Similarly, in the Ben Yehuda Project – an Israeli sˇ. (ii) They are NOT adjectives. (iii) They version of the Guttenberg project7 which includes ש binder texts from the Middle Ages up to the beginning of רציתי ,.have irregular inflections in the past tense, i.e raciti lada‘at (I wanted to know) is not a modal the 20th century – we have found 28 instances of the לדעת usage. word, with the very same usage as in older times. The tests to distinguish modal from non-modal While trying to figure its part of speech, we do not usages are: -as a NOUN - as it cannot have a defi אכפת identify .and is not an adjective9 ,8ה nite marker which can be also existential, are אין and יש • to אכפת Traditional Hebrew Dictionaries consider used as modals if they can be replaced with be an intransitive verb (Kohut, 1926; Even-Shoshan, .צריך 2003; Avneyon et al., 2002) or an adverb. Some • Adjectives are gradable and can be modified by dictionaries from the middle of the 20th century (Gur, 1946; Knaani, 1960), as well as recent ones .(yoter (more יותר m’od (very) or מאוד (Choueka et al., 1997) did not give it a part of speech • Adjectives can become describers of the nom- at all. In our corpus we found 130 occurences of the ההריסה קלה מאוד ⇒ קל להרוס :inalized verb of which 55 have an infinitive/relative אכפת qal laharos ⇒ haharisah qala m’od (easy to word destroy ⇒ the destruction is easy). clause complement, 35 have null complement, and ikpat’ אכפת לו מהמדינה m PP complement מ have 40 • In all other cases where a verb is serving in 6http://hebrew-treasures.huji.ac.il/ an enterprise conducted to convey modality, it is still tagged as a verb, by the Israeli Academy of the Hebrew Language. muban syosiˇ hu’ 7http://www.benyehuda.org, http://www.gutenberg.org מובן שיוסי הוא המנצח ,.e.g 8 נסתמו לי hamnaceh. (it is clear that Yossi is the winner). Although we found in the internet clauses as nistmu li naqbubiyot ha’ikpat (My caring נקבוביות האכפת pores got blocked). ikpati, ikpatiyut’ אכפתי, אכפתיות We first review how these guidelines help us ad- 9Only its derivatives dress some of the most difficult tagging decisions we (caring, care) allows adjectival usage. lo mehamdina (he cares for the country). The latter to rephrase a clause in a way that the adjective mod- has no modal interpretation. We claim that in this ifies the noun, i.e., the range is the action itself and .The not its subject .(בינוני) case it should be tagged as a participle קשה לבצע את ההסכם(test to tell apart modal and participle is: (14 qaseˇ lbace‘ ’et haheskem ⇒ אכפת לו לשטוף כלים(12) hard to-perform PREP the-agreement *הוא אכפתי כלפי שטיפת כלים ikpat lo listopˇ kelim ⇒ It is hard to perform the agreement ההסכם קשה*(hu’ ’ikpati klapei stˇ.ipat kelim (15* ⇒ mind him to-wash dishes haheskem kaseˇ *he concerned for washing dishes the-agreement hard ⇒ He minds washing dishes The agreement is hard *He is concerned about washing dishes However, following Ambar, there are cases where ⇒ -qaseˇ le is not modal, but an emo קשה ל the usage of אכפת לו מהעניים(13) :tional adjective הוא אכפתי כלפי העניים ’ikpat lo meha‘aniyim ⇒ קשה/נעים לשוחח איתו(16) hu’ ikpati klapei ha‘aniyim qase/na‘imˇ lsohˇ eh ’ito care him of-the-poor-people ⇒ . . hard/pleasant to-chat with-him he caring for the-poor-people It is hard/pleasant to chat with him He cares for the poor people ⇒ He is caring for the poor people Berman (1980) classifies subjectless construc- tions in Modern Hebrew, and distinguishes what she All other tests for modality hold in this case: (1) calls dative-marked experientials where (mostly) Infinitive/relative clause complement, (2) Not an ad- adjective serves as a predicate followed by a dative- jective, (3) Irregular inflection (no inflection at all). marked nominal To conclude this section, our proposed definition of קשה לרינה בחיים(modals allows us to tag this word in a systematic (17 and complete manner and to avoid the confusion that qaseˇ le-rinah bah. ayim characterizes this word. hard for-Rina in-the-life It is hard for Rina in life -Adjectives that allow this construction are cir קשה לי ”It’s really hard” 3.2 רינה :Some of the words tagged as modals are commonly cumstantial and do not describe an inner state acub‘ עצוב לרינה .rina ‘acuba (Rina is sad) vs עצובה ,asur’ אסור, מותר referred to as adjectives, such as mutar (allowed, forbidden), though everyone agrees lrina (it is sad for Rina). Another recognized con- - and tags these words as adverbs or participals (see struction is the modal expressions that include sen- table 1). However, questions are raised of how to tell tences with dative marking on the individuals to אסור לנו לדבר ככה apart modals as such from adjectives that show very whom the modality is imputed qaseˇ li laleket (it is ’asur lanu ldaber kakah (we are not allowed to talk קשה לי ללכת :similar properties hard for me to walk). Ambar (1995) analyzes the like this); Berman suggests that the similarity is due usage of adjectives in modal contexts, especially of to the perception of the experiencer as recipient in both cases; This suggestion implies that Berman קשה לנו ability and possibility. In sentences such as -qaseˇ lanu lhistagel lara‘asˇ (it is hard does not categorize the modals (’asur, mutar) as ad להסתגל לרעש for us to get used to the noise) the adjective is used in jectives. Another possible criterion to allow these a modal and not an adverbial meaning, in the sense words to be tagged as modals (following Zadka) is bqwsiˇ (with diffi- the fact that for Necessary/Obligation modals there בקושי that meaning of the adverbial yakwl (can) are unified into exists an ’outside force’ which is the agent of the יכול culty) and the modal אסור לנו לדבר ככה Similarly, the possibility sense of modal situation. Therefore, if .קשה a single word epsarˇ . In any ’asur lanu ldaber kakah (we are not allowed to’ אפשר is unified with the modal קשה usage of the adjective as the modal, it is not possible talk like this), this is because someone forbids us :qaseˇ lrinah Eitan Avneyon, Raphael Nir, and Idit Yosef. 2002. Milon sapir קשה לרינה בחיים from talking, while if bahayim (It is hard for Rina in life) then no ”out- The Encyclopedic Sapphire Dictionary. Hed Artsi, Tel Aviv. . (in Hebrew). side force” is obliged to be the agent which makes her life hard. To conclude - we suggest tagging both Ruth Berman. 1980. The case of (s)vo language: Subjectless constructions in Modern Hebrew. Language, 56:759–776. ’asur and mutar as modals, and we recommend al- lowing modal tagging for other possible adjectives David Carmel and Yoelle S. Maarek. 1999. Morphological disambiguation for Hebrew search systems. In Proceeding in this syntactic structure. of NGITS-99, pages 312–326. Yaacov Choueka, Uzi Freidkin, Hayim A. Hakohen, , and Yael 4 Conclusion Zachi-Yannay. 1997. Rav Milim: A Comprehensive Dictio- nary of Modern Hebrew. Stimatski, Tel Aviv. (in Hebrew). We recommend the introduction of a modal POS tag in Hebrew, despite the fact that the set of criteria to Ferdinand de Haan, 2005. Typological Approaches to Modal- ity in Approaches to Modality, pages 27–69. Mouton de identify modal usage is a complex combination of Gruyter, Berlin. syntactic and morphological constraints. This class Oswald Ducrot and Tzvetan Todorov. 1972. Dictionnaire en- covers as many as 3% of the tokens observed in our cyclopedique´ des sciences du langage. Editions´ de Seuil, corpus. Paris. Our main motivation in introducing this tag in our Avraham Even-Shoshan. 2003. Even Shoshan’s Dictionary - tagset is that the alternative (no modal tag) creates Renewed and Updated for the 2000s. Am Oved, Kineret, confusion and disagreement: we have shown that Zmora-Bitan, Dvir and Yediot Aharonot. (in Hebrew). both traditional dictionaries and previous computa- Yehuda Gur. 1946. The Hebrew Language Dictionary. Dvir, tional resources had a high level of disagreement Tel Aviv. (in Hebrew). over the class of words we tag as modals. We have M. A. K. Halliday. 1985. An introduction to functional gram- confirmed that our guidelines can be applied consis- mar. Edward Arnold, USA, second edition. tently by human taggers, with agreement level simi- Yaakov Knaani. 1960. The Hebrew Language Lexicon. lar to the rest of the tokens (over 99% pairwise). We Masada, Jerusalem. (in Hebrew). have checked that our guidelines stand the test of the Alexander Kohut. 1926. Aruch Completum auctore Nathane most difficult disagreement types identified by tag- filio Jechielis. Hebraischer Verlag - Menorah, Wien-Berlin. (in Hebrew). gers, such as “care to” and “difficult for”. Finally, the immediate context of modals includes Ziona Kopelovich. 1982. Modality in Modern Hebrew. Ph.D. thesis, University of Michigan. a high proportion of infinitive words. Infinitive words in Hebrew are particularly ambiguous mor- Uzi Ornan. 2002. Hebrew in Latin script. Le˘sonˇ enu´ , LXIV:137–151. (in Hebrew). ל phologically, because they begin with the letter l which is a formative letter, and often include the Haiim B. Rosen. 1977. Contemporary Hebrew. Mouton, The Hague, Paris. ,can be interpreted לשמור .analysis le+ participle, e.g depending on context, as lismwrˇ (to guard), le- Beatrice Santorini. 1995. Part-of-speech tagging guidelines for the Penn Treebank Project. 3rd revision;. Technical report, samurˇ (to a guarded), or la-samurˇ (to the guarded). Department of Computer and Information Science, Univer- .can be sity of Pennsylvania לשיר ,.Other ambiguities might occur too, e.g interpreted as lasirˇ (to sing), le-sirˇ (to a song), or as Erel Segal. 2000. Hebrew morphological analyzer for Hebrew la-sirˇ (to the song). We have measured that on aver- undotted texts. Master’s thesis, Technion, Haifa, Israel. (in age, infinitive verbs in our expanded corpus can be Hebrew). analyzed in 4.9 distinct manners, whereas the overall Khalil Sima’an, Alon Itai, Alon Altman Yoad Winter, and Noa average for all word tokens is 2.65. The identifica- Nativ. 2001. Building a tree-bank of modern Hebrew text. Journal Traitement Automatique des Langues (t.a.l.). Spe- tion of modals can serve as an anchor which helps cial Issue on NLP and Corpus Linguistics. disambiguate neighboring infinitive words. Shlomo Yona. 2004. A finite-state based morphological ana- lyzer for Hebrew. Master’s thesis, Haifa University. References Yitzhak Zadka. 1995. The single object ”rider” verb in cur- rent Hebrew: Classification of modal, adverbial and aspec- Ora Ambar. 1995. From modality to an emotional situation. tual verbs. Te‘udah, 9:247–271. (in Hebrew). Te‘udah, 9:235–245. (in Hebrew).