Semantic Role Labeling Systems for Arabic Using Kernel Methods

Semantic Role Labeling Systems for Arabic using Kernel Methods Mona Diab Alessandro Moschitti Daniele Pighin CCLS, Columbia University DISI, University of Trento FBK-irst; DISI, University of Trento New York, NY 10115, USA Trento, I-38100, Italy Trento, I-38100, Italy [email protected] [email protected] [email protected] Abstract and the subject of the second, it is the ‘theme’ in There is a widely held belief in the natural lan- both sentences. Same idea applies to passive con- guage and computational linguistics commu- structions, for example. nities that Semantic Role Labeling (SRL) is There is a widely held belief in the NLP and com- a significant step toward improving important putational linguistics communities that identifying applications, e.g. question answering and in- and defining roles of predicate arguments in a sen- formation extraction. In this paper, we present an SRL system for Modern Standard Arabic tence has a lot of potential for and is a significant that exploits many aspects of the rich mor- step toward improving important applications such phological features of the language. The ex- as document retrieval, machine translation, question periments on the pilot Arabic Propbank data answering and information extraction (Moschitti et show that our system based on Support Vector al., 2007). Machines and Kernel Methods yields a global To date, most of the reported SRL systems are for SRL F1 score of 82.17%, which improves the English, and most of the data resources exist for En- current state-of-the-art in Arabic SRL. glish. We do see some headway for other languages 1 Introduction such as German and Chinese (Erk and Pado, 2006; Sun and Jurafsky, 2004). The systems for the other Shallow approaches to semantic processing are mak- languages follow the successful models devised for ing large strides in the direction of efficiently and English, e.g. (Gildea and Jurafsky, 2002; Gildea and effectively deriving tacit semantic information from Palmer, 2002; Chen and Rambow, 2003; Thompson text. Semantic Role Labeling (SRL) is one such ap- et al., 2003; Pradhan et al., 2003; Moschitti, 2004; proach. With the advent of faster and more power- Xue and Palmer, 2004; Haghighi et al., 2005). In the ful computers, more effective machine learning al- same spirit and facilitated by the release of the Se- gorithms, and importantly, large data resources an- mEval 2007 Task 18 data1, based on the Pilot Arabic notated with relevant levels of semantic information, Propbank, a preliminary SRL system exists for Ara- such as the FrameNet (Baker et al., 1998) and Prob- bic2 (Diab and Moschitti, 2007; Diab et al., 2007a). Bank (Kingsbury and Palmer, 2003), we are seeing However, it did not exploit some special character- a surge in efficient approaches to SRL (Carreras and istics of the Arabic language on the SRL task. Màrquez, 2005). In this paper, we present an SRL system for MSA SRL is the process by which predicates and their that exploits many aspects of the rich morphological arguments are identified and their roles are defined features of the language. It is based on a supervised in a sentence. For example, in the English sen- model that uses support vector machines (SVM) tence, ‘John likes apples.’, the predicate is ‘likes’ technology (Vapnik, 1998) for argument boundary whereas ‘John’ and ‘apples’, bear the semantic role detection and argument classification. It is trained labels agent (ARG0) and theme (ARG1). The cru- and tested using the pilot Arabic Propbank data re- cial fact about semantic roles is that regardless of leased as part of the SemEval 2007 data. Given the the overt syntactic structure variation, the underly- lack of a reliable Arabic deep syntactic parser, we ing predicates remain the same. Hence, for the sentence ‘John opened the door’ and ‘the door opened’, 1http://nlp.cs.swarthmore.edu/semeval/ though ‘the door’ is the object of the first sentence 2We use Arabic to refer to Modern Standard Arabic (MSA). 798 Proceedings of ACL-08: HLT, pages 798–806, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics use gold standard trees from the Arabic Tree Bank the VSO constructions, the verb agrees with the syn- (ATB) (Maamouri et al., 2004). tactic subject in Gender only, while in the SVO con- This paper is laid out as follows: Section 2 structions, the verb agrees with the subject in both presents facts about the Arabic language especially Number and Gender. Even though, in the ATB, an in relevant contrast to English; Section 3 presents equal distribution of both VSO and SVO is observed the approach and system adopted for this work; Sec- (each appearing 30% of the time), it is known that tion 4 presents the experimental setup, results and in general Arabic is predominantly in VSO order. discussion. Finally, Section 5 draws our conclu- Moreover, the pro-drop cases could effectively be sions. perceived as VSO orders for the purposes of SRL. 2 Arabic Language and Impact on SRL Syntactic Case is very important in the cases of VSO and pro-drop constructions as they indicate the syn- Arabic is a very different language from English in tactic roles of the object arguments with accusative several respects relevant to the SRL task. Arabic is a Case. Unless the morphology of syntactic Case is semitic language. It is known for its templatic mor- explicitly present, such free word order could run phology where words are made up of roots and af- the SRL system into significant confusion for many fixes. Clitics agglutinate to words. Clitics include of the predicates where both arguments are semanti- prepositions, conjunctions, and pronouns. cally of the same type. In contrast to English, Arabic exhibits rich mor- Arabic exhibits more complex noun phrases than phology. Similar to English, Arabic verbs explic- English mainly to express possession. These con- itly encode tense, voice, Number, and Person fea- structions are known as idafa constructions. Mod- tures. Additionally, Arabic encodes verbs with Gen- ern standard Arabic does not have a special parti- der, Mood (subjunctive, indicative and jussive) in- cle expressing possession. In these complex struc- formation. For nominals (nouns, adjectives, proper tures a surface indefinite noun (missing an explicit names), Arabic encodes syntactic Case (accusative, definite article) may be followed by a definite noun genitive and nominative), Number, Gender and Def- marked with genitive Case, rendering the first noun initeness features. In general, many of the morpho- syntactically definite. For example, I JË@ ÉgP rjl logical features of the language are expressed via . 3 Albyt ‘man the-house’ meaning ‘man of the house’, short vowels also known as diacritics . ÉgP becomes definite. An adjective modifying the Unlike English, syntactically Arabic is a pro-drop . noun Ég. P will have to agree with it in Number, language, where the subject of a verb may be im- Gender, Definiteness, and Case. However, with- plicitly encoded in the verb morphology. Hence, we Q out explicit morphological encoding of these agree- observe sentences such as ÈA®K .Ë@ É¿@ Akl AlbrtqAl ments, the scope of the arguments would be con- ‘ate-[he] the-oranges’, where the verb Akl encodes fusing to an SRL system. In a sentence such as the third Person Masculine Singular subject in the ÉK ñ¢Ë@ I J.Ë@ Ég. P rjlu Albyti AlTwylu meaning ‘the verbal morphology. It is worth noting that in the tall man of the house’: ‘man’ is definite, masculine, ATB 35% of all sentences are pro-dropped for sub- singular, nominative, corresponding to Definiteness, ject (Maamouri et al., 2006). Unless the syntactic Gender, Number and Case, respectively; ‘the-house’ parse is very accurate in identifying the pro-dropped is definite, masculine, singular, genitive; ‘the-tall’ is case, identifying the syntactic subject and the under- definite, masculine, singular, nominative. We note lying semantic arguments are a challenge for such that ‘man’ and ‘tall’ agree in Number, Gender, Case pro-drop cases. and Definiteness. Syntactic Case is marked using Arabic syntax exhibits relative free word order. short vowels u, and i at the end of the word. Hence, Arabic allows for both subject-verb-object (SVO) rjlu and AlTwylu agree in their Case ending5 With- 4 and verb-subject-object (VSO) argument orders. In out the explicit marking of the Case information, 3Diacritics encode the vocalic structure, namely the short vowels, as well as the gemmination marker for consonantal dou- 5The presence of the Albyti is crucial as it renders rjlu defi- bling, among other markers. nite therefore allowing the agreement with AlTwylu to be com- 4MSA less often allows for OSV, or OVS. plete. 799 S VP NPARGM−TMP VBDpredicate NPARG0 NPARG1 NP @YK. NP NP NP PP NN JJ started YgB@ úæAÖ Ï@ NN NP NNP NNP NN JJ IN NP úm'ðP Þ Sunday past KP NN JJ ðP . èPAK P éJ Ö P È NNP president Zhu Rongji visit official to Z@PPñË@ ú æJ Ë@ YJêË@ ministers Chinese India Figure 1: Annotated Arabic Tree corresponding to ‘Chinese Prime minister Zhu Rongjy started an official visit to India last Sunday.’ namely in the word endings, it could be equally valid Feature Name Description Predicate Lemmatization of the predicate word that ‘the-tall’ modifies ‘the-house’ since they agree Path Syntactic path linking the predicate and in Number, Gender and Definiteness as explicitly an argument, e.g. NN↑NP↑VP↓VBX Partial path Path feature limited to the branching of marked by the Definiteness article Al. Hence, these the argument idafa constructions could be tricky for SRL in the No-direction path Like Path without traversal directions absence of explicit morphological features. This is Phrase type Syntactic type of the argument node Position Relative position of the argument with compounded by the general absence of short vowels, respect to the predicate expressed by diacritics (i.e.

Load more