Determining Case in Arabic: Learning Complex Linguistic Behavior Requires Complex Linguistic Features

Determining Case in Arabic: Learning Complex Linguistic Behavior Requires Complex Linguistic Features Nizar Habash†, Ryan Gabbard‡, Owen Rambow†, Seth Kulick‡ and Mitch Marcus‡ †Center for Computational Learning Systems, Columbia University New York, NY, USA {habash,rambow}@cs.columbia.edu ‡Department of Computer and Information Science,University of Pennsylvania Philadelphia, PA, USA {gabbard,skulick,mitch}@cis.upenn.edu Abstract recognition (Vergyri and Kirchhoff, 2004)), we need to perform more complex syntactic processing to re- This paper discusses automatic determina- store case diacritics. Options include using the out- tion of case in Arabic. This task is a ma- put of a parser in determining case. jor source of errors in full diacritization of Arabic. We use a gold-standard syntac- An additional motivation for investigating case in tic tree, and obtain an error rate of about Arabic comes from treebanking. Native speakers 4.2%, with a machine learning based system of Arabic in fact are native speakers of one of the outperforming a system using hand-written Arabic dialects, all of which have lost case (Holes, rules. A careful error analysis suggests that 2004). They learn MSA in school, and have no when we account for annotation errors in the native-speaker intuition about case. Thus, determin- gold standard, the error rate drops to 0.8%, ing case in MSA is a hard problem for everyone, with the hand-written rules outperforming including treebank annotators. A tool to catch case- the machine learning-based system. related errors in treebanking would be useful. In this paper, we investigate the problem of de- 1 Introduction termining case of nouns and adjectives in syntactic In Modern Standard Arabic (MSA), all nouns and trees. We use gold standard trees from the Arabic adjectives have one of three cases: nominative Treebank (ATB). We see our work using gold stan- (NOM), accusative (ACC), or genitive (GEN). What dard trees as a first step towards developing a sys- sets case in MSA apart from case in other languages tem for restoring case to the output of a parser. The is most saliently the fact that it is usually not marked complexity of the task justifies an initial investiga- in the orthography, as it is written using diacrit- tion based on gold standard trees. And of course, the ics which are normally omitted. In fact, in a re- use of gold standard trees is justified for our other cent paper on diacritization, Habash and Rambow objective, helping quality control for treebanking. (2007) report that word error rate drops 9.4% ab- The study presented in this paper shows the im- solute (to 5.5%) if the word-final diacritics (which portance of what has been called “feature engineer- include case) need not be predicted. Similar drops ing” and the issue of representation for machine have been observed by other researchers (Nelken learning. Our initial machine learning experiments and Shieber, 2005; Zitouni et al., 2006). Thus, we use features that can be read off the ATB phrase can deduce that tagging-based approaches to case structure trees in a straightforward manner. The lit- identification are limited in their usefulness, and if erature on case in MSA (prescriptive and descrip- we need full diacritization for subsequent process- tive sources) reveals that case assignment in Ara- ing in a natural language processing (NLP) applica- bic does not always follow standard assumptions tion (say, language modeling for automatic speech about predicate-argument structure, which is what 1084 Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1084–1092, Prague, June 2007. c 2007 Association for Computational Linguistics the ATB annotation is based on. Therefore, we 2.1 Morphological Realization of Case transform the ATB so that the new representation The realization of nominal case in Arabic is com- is based entirely on case assignment, not predicate- plicated by its orthography, which uses optional dia- argument structure. The features for machine learn- critics to indicate short vowel case morphemes, and ing that can now be read off from the new represen- by its morphology, which does not always distin- tation yield much better results. Our results show guish between all cases. Additionally, case realiza- that we can determine case with an error rate of tion in Arabic interacts heavily with the realization 4.2%. However, our results would have been impos- of definiteness, leading to different realizations de- sible without a deeper understanding of the linguis- pending on whether the nominal is indefinite, i.e., re- tic phenomenon of case and a transformation of the §©¨ ¤¦¥ ¢ ceiving nunation ( ¡£¢ ), definite through the deter- representation oriented towards this phenomenon. miner Al+ ( ) or definite through being the gover- Using either underlying representation, machine ¨ nor of an idafa possessive construction ( ). Most learning performs better than hand-written rules. details of this interaction are outside the scope of this However, a closer look at the errors made by the paper, but we discuss it as much as it helps clarify is- machine learning-derived classifier and the hand- sues of case. written rules reveals that most errors are in fact Buckley (2004) describes eight different classes treebank errors (between 69% and 86% of all er- of nominal case expression, which we briefly review. rors for the machine learning-derived classifier and We first discuss the realization of case in morpholog- the hand-written rules, respectively). Furthermore, ically singular nouns (including broken, i.e., irregu- the machine learning classifier agrees more often lar, plurals). Triptotes are the basic class which ex- with treebank errors than the hand-written rules do. presses the three cases in the singular using the three This fact highlights the problem of machine learning 1 short vowels of Arabic: NOM is +u, ACC is +a, (garbage in, garbage out), but holds out the prospect for improvement in the machine learning based clas- and GEN is +i. The corresponding nunated forms sifier as the treebank is checked for errors and re- for these three diacritics are: +u˜ for NOM, +ã for released. ACC, and +˜ı for GEN. Nominals not ending with ¨ In the next section, we describe all relevant lin- Ta Marbuta ( h¯) or Alif Hamza ( A’) receive an §¨ ¢ guistic facts of case in Arabic. Section 3 details the extra Alif in the accusative indefinite case (e.g, resources used in this research. Section 4 describes ¨ § ¨ kitAbAã ‘book’ versus ¢ kitAbah¯ã ‘writing’). the preprocessing done to extract the relevant lin- Diptotes are like triptotes except that when they guistic features from the ATB. Sections 5 and 6 de- are indefinite, they do not express nunation and they tail the two systems we compare. Sections 7 and 8 use the +a suffix for both ACC and GEN. The class present results and an error analysis of the two sys- of diptotes is lexically specific. It includes nomi- tems. And we conclude with a discussion of our nals with specific meanings or morphological pat- findings in Section 9. terns (colors, elatives, specific broken plurals, some proper names with Ta Marbuta ending or location 2 Linguistic Facts names devoid of the definite article). Examples (*) ) !#"%$'& ¨ ¢ ¤ ¨ include bayruwt ‘Beirut’ and Âazraq All Arabic nominals (common nouns, proper nouns, 1All Arabic transliterations are provided in the Habash- adjectives and adverbs) are inflected for case, which Soudi-Buckwalter transliteration scheme (Habash et al., 2007). has three values in Arabic: nominative (NOM), ac- This scheme extends Buckwalter’s transliteration scheme cusative (ACC) or genitive (GEN). We know this (Buckwalter, 2002) to increase its readability while maintaining the 1-to-1 correspondence with Arabic orthography as repre- from case agreement facts, even though the mor- sented in standard encodings of Arabic, i.e., Unicode, CP-1256, phology and/or orthography do not necessarily al- etc. The following are the only differences from Buckwalter’s ¯ , - , ways make the case realization overt. We discuss scheme (which is indicated in parentheses): A + (|), Â (>), 7 5 . ˇ 8 /10 23 4 6 4 morphological and syntactic aspects of case in MSA wˆ (&), A , (<), yˆ (}), ¯h (p), θ (v), ð (∗), š ($), - 9 - < < < 6 : 0 ; = ˇ : in turn. D 6 (Z), ς (E), γ (g), ý (Y), ã (F), u˜ (N), ˜ı (K). ; 1085 ‘blue’. includes:2 The next three classes are less common. The in- • NOM is assigned to subjects of verbal clauses, variables show no case in the singular (e.g. nomi- as well as other nominals in headings, titles and ) ¢ ¥¡ nals ending in long vowels: ¤ suwryA ‘Syria’ or quotes. £ ¢ $ ðikraý ‘memoir’). The indeclinables always • ACC is assigned to (direct and indirect) objects use the +a suffix to express case in the singular and of verbal clauses, verbal nouns, or active par- ¥§¦©¨ ¤ allow for nunation ( maςnaýã ‘meaning’). The ticiples; to subjects of small clauses governed defective nominals, which are derived from roots by other verbs (i.e., “exceptional case marking” with a final radical glide (y or w), look like triptotes or “raising to object” contexts; we remain ag- except that they collapse NOM and GEN into the nostic on the proper analysis); adverbs; and cer- $§ GEN form, which also includes loosing their final tain interjections, such as šukrAã ‘Thank ¨ ¨ § ¤ glide: qAD˜ı (NOM,GEN) versus qADiyAã you’. (ACC) ‘a judge’. • GEN is assigned to objects of prepositions and For the dual and sound plural, the situa- to possessors in idafa (possessive) construction. tion is simpler, as there are no lexical excep- tions. The duals and masculine sound plurals • There is a distinction between case-by- express number, case and gender jointly in sin- assignment and case-by-agreement. In case- gle morphemes that are identifiable even if undia- by-assignment, a specific case is assigned to ¨ § a nominal by its case assigner; whereas in ¢ critized: ¥ kAtib+uwna ‘writersmasc,pl’ (NOM), case-by-agreement, the modifying or conjoined § ¨ ¨ § ¨ ¢ ¢ kAtib+Ani ‘writersmasc,du’ (NOM), nominal copies the case of its governor.

Load more