Can You Tag the Modal? You Should
Total Page:16
File Type:pdf, Size:1020Kb
Can You Tag the Modal? You Should. Yael Netzer and Meni Adler and David Gabay and Michael Elhadad Ben Gurion University of the Negev Department of Computer Science POB 653 Be’er Sheva, 84105, Israel (yaeln|adlerm|gabayd|elhadad)@cs.bgu.ac.il Abstract 1 Introduction In this paper we address one linguistic issue that was Computational linguistics methods are typ- raised while tagging a Hebrew corpus for part of ically first developed and tested in English. speech (POS) and morphological information. Our When applied to other languages, assump- corpus is comprised of short news stories. It in- tions from English data are often applied cludes roughly 1,000,000 tokens, in articles of typ- to the target language. One of the most ical length between 200 to 1000 tokens. The arti- common such assumptions is that a “stan- cles are written in a relatively simple style, with a dard” part-of-speech (POS) tagset can be high token/word ratio. Of the full corpus, a sam- used across languages with only slight vari- ple of articles comprising altogether 100,000 tokens ations. We discuss in this paper a specific is- was assembled at random and manually tagged for sue related to the definition of a POS tagset part of speech. We employed four students as tag- for Modern Hebrew, as an example to clar- gers. An initial set of guidelines was first composed, ify the method through which such varia- relying on the categories found in several dictionar- tions can be defined. It is widely assumed ies and on the Penn treebank POS guidelines (San- that Hebrew has no syntactic category of torini, 1995). Tagging was done using an automatic modals. There is, however, an identified tool1. We relied on existing computational lexicons class of words which are modal-like in their (Segal, 2000; Yona, 2004) to generate candidate tags semantics, and can be characterized through for each word. As many words from the corpus were distinct syntactic and morphologic criteria. either missing or tagged in a non uniform manner in We have found wide disagreement among the lexicons, we recommended looking up missing traditional dictionaries on the POS tag at- words in traditional dictionaries. Disagreement was tributed to such words. We describe three also found among copyrighted dictionaries, both for main approaches when deciding how to tag open and closed set categories. Given the lack of such words in Hebrew. We illustrate the im- a reliable lexicon, the taggers were not given a list pact of selecting each of these approaches of options to choose from, but were free to tag with on agreement among human taggers, and on whatever tag they found suitable. The process, al- the accuracy of automatic POS taggers in- though slower and bound to produce unintentional duced for each method. We finally recom- mistakes, was used for building a lexicon, and to mend the use of a “modal” tag in Hebrew refine the guidelines and on occasion modify the and provide detailed guidelines for this tag. POS tagset. When constructing and then amending Our overall conclusion is that tagset defini- the guidelines we sought the best trade-off between tion is a complex task which deserves appro- priate methodology. 1http://wordfreak.sourceforge.net accuracy and meaningfulness of the categorization, (Knaani, 1960), (5) HMA (Carmel and Maarek, and simplicity of the guidelines, which is important 1999), (6) Segal (Segal, 2000), (7) Yona (Yona, for consistent tagging. 2004), (8) Hebrew Treebank (Sima’an et al., 2001). Initially, each text was tagged by four different As can be seen, eight different POS tags were people, and the guidelines were revised according suggested by these dictionaries: adJective (29.6%), to questions or disagreements that were raised. As adveRb (25.9%), Verb (22.2%), Auxilary verb the guidelines became more stable, the disagreement (8.2%), Noun (4.4%), parTicle (3.7%), Preposition rate decreased, each text was tagged by three peo- (1.5%), and Unknown (4.5%). The average number ple only and eventually two taggers and a referee of options per word is about 3.3, which is about 60% that reviewed disagreements between the two. The agreement. For none of the words there was a com- disagreement rate between any two taggers was ini- prehensive agreement, and the PoS of only seven tially as high as 20%, and dropped to 3% after a few words (43.75%) can be determinded by voting (i.e., rounds of tagging and revising the guidelines. there is one major option). Major sources of disagreements that were identi- In the remainder of the paper, we investigate the fied, include: existence of a modal category in Modern Hebrew, Prepositional phrases vs. prepositions In Hebrew, by analyzing the characteristic of these words, from b,c,l,m2 – can be attached a morphological, syntactic, semantic and practical ב,כ,ל,מ – formative letters to a noun to create a short prepositional phrase. In point of view. The decision whether to introduce some cases, such phrases function as a preposition a modal tag in a Hebrew tagset has practical con- and the original meaning of the noun is not clearly sequences: we counted that over 3% of the tokens felt. Some taggers would tag the word as a prepo- in our 1M token corpus can potentially be tagged sitional prefix + noun, while others tagged it as a as modals. Beyond this practical impact, the deci- b’iqbot (following), that sion process illustrates the relevant method through בעקבות ,.preposition, e.g .b-iqbot (in the footsteps which a tagset can be derived and fine tuned ב־עקבות can be tagged as of). Adverbial phrases vs. Adverbs the problem is simi- 2 Modality in Hebrew bdiyuq (exactly), can בדיוק ,.lar to the one above, e.g be tagged as b-diyuq (with accuracy). Semantically, Modus is considered to be the attitude Participles vs. Adjectives as both categories can on the part of the speaking subject with regard to modify nouns, it is hard to distinguish between its content (Ducrot and Todorov, 1972), as opposed to the Dictum which is the linguistic realization of mabat. m’ayem (a threatening מבט מאיים ,them, e.g m’ayem is unclear. a predicate. While a predicate is most commonly מאיים stare) - the category of Another problem, on which the remainder of the ar- represented with a verb, modality can be uttered in ticle focuses, was a set of words that express modal- various manners: adjectives and adverbs (definitely, ity, and commonly appear before verbs in the infini- probable), using thought/belief verbs, mood, into- tive. Such words were tagged as adjectives or ad- nation, or with modal verbs. The latter are recog- verbs, and the taggers were systematically uncertain nized as a grammatical category in many languages about them. (modals), e.g., can, should and must in English. Beside the disagreement among taggers, there From the semantic perspective, modality is was also significant disagreement among Modern coarsely divided into epistemic modality (the Hebrew dictionaries we examined, as well as com- amount of confidence the speaker holds with ref- putational analyzers and annotated corpora. Ta- erence to the truth of the proposition) and deontic ble 1 lists the various selected POS tags for these modality (the degree of force exerted on the sub- words, as determined by: (1) Rav Milim (Choueka ject of the sentence to perform the action) views (de et al., 1997), (2) Sapir (Avneyon et al., 2002), (3) Haan, 2005). Even-Shoshan (Even-Shoshan, 2003), (4) Knaani Modal expressions do not constitute a syntactic class in Modern Hebrew (Kopelovich, 1982). In 2Transcription according to (Ornan, 2002) her work, Kopelovich reviews classic descriptive Word Example 1 2 3 4 5 6 7 8 יש לשים לב לניסוח יש yesˇ yesˇ lasim´ leb lanisuh. R N N R N A R V should Attention should be paid to the wording אין לשים לב לניסוח אין N ’ein ’ein lasim´ leb lanisuh R U U P P R V . R shouldn’t Attention should not be paid to the wording הציבור חייב להבין את העניין חייב h. ayab hacibur h. ayab lhabin ’et ha‘inyan J J J J J J J V must The public should be made aware of this issue מותר לה לצאת לטיול מותר mutar mutar lah lacet lt.iyul R N J R J A V J allowed She is allowed to go on a trip אסור לה לצאת לטיול ביום ראשון אסור J ’asur ’asur lah lacet ltiyul byom risonˇ R R R R J A J . V forbidden She is not allowed to go on a trip on Sunday אפשר לרמוז רמזים אפשר ’epsrˇ ’epsrˇ lirmoz rmazim U R R R T A R V may Giving hints is allowed נשים אמורות ללבוש רעלות אמור ’amur nasimˇ ’amurot lilbosˇ r‘alot J A J J J A J V supposed Women are supposed to wear veils במו״מ צריך לעמוד על שלך צריך carik bmw”m carik la‘amod ‘al selkaˇ J J R J J A J V should In negotiation you should keep strong ניתן לפתור בעיה מכאיבה זו ניתן nitan nitan liptor b‘ayah mak’ibah zo U V V V V V V V can This troublesome problem can be solved הכלב עלול לנשוך עלול J ‘alul hakeleb ‘alul linsokˇ J J J J A J V N may The dog may bite כדאי לשאול האם הדלת עשויה היטב כדאי kda’y kda’y lis’olˇ ha’im hadelet ‘asuyah´ heit.eb R R R R J A R J worthwhile It is worth asking whether the door is well built מוטב להיות בשקט ולהנות מוטב mutab mutab lihyot beseqetˇ .