Adapting Predicate Frames for Urdu Propbanking

Adapting Predicate Frames for Urdu PropBanking Riyaz Ahmad Bhat♣, Naman Jain♣, Dipti Misra Sharma♣, Ashwini Vaidya♠, Martha Palmer♠, James Babani♠ and Tafseer Ahmed♦ LTRC, IIIT-H, Hyderabad, India♣ University of Colorado, Boulder, CO 80309 USA♠ DHA Suffa University, Karachi, Pakistan♦ riyaz.bhat, naman.jain @research.iiit.ac.in, [email protected], { } vaidyaa, mpalmer, james.babani @colorado.edu, [email protected] { } Abstract drawing its higher lexicon from Sanskrit and Urdu from Persian and Arabic) to the point where the Hindi and Urdu are two standardized reg- two styles/languages become mutually unintelligi- isters of what has been called the Hindus- ble. In written form, not only the vocabulary but tani language, which belongs to the Indo- the way Urdu and Hindi are written makes one be- Aryan language family. Although, both lieve that they are two separate languages. They the varieties share a common grammar, are written in separate orthographies, Hindi be- they differ significantly in their vocabulary ing written in Devanagari, and Urdu in a modi- to an extent where both become mutually fied Persio-Arabic script. Given such (apparent) incomprehensible (Masica, 1993). Hindi divergences between the two varieties, two paral- draws its vocabulary from Sanskrit while lel treebanks are being built under The Hindi-Urdu Urdu draws its vocabulary from Persian, treebanking Project (Bhatt et al., 2009; Xia et al., Arabic and even Turkish. In this paper, 2009). Both the treebanks follow a multi-layered we present our efforts to adopt frames of and multi-representational framework which fea- nominal and verbal predicates that Urdu tures Dependency, PropBank and Phrase Structure shares with either Hindi or Arabic for annotations. Among the two treebanks the Hindi Urdu PropBanking. We discuss the fea- treebank is ahead of the Urdu treebank across all sibility of porting such frames from either layers. In the case of PropBanking, the Hindi tree- of the sources (Arabic or Hindi) and also bank has made considerable progress while Urdu present a simple and reasonably accurate PropBanking has just started. method to automatically identify the origin of Urdu words which is a necessary The creation of predicate frames is the first step step in the process of porting such frames. in PropBanking, which is followed by the actual annotation of verb instances in corpora. In this 1 Introduction paper, we look at the possibility of porting re- Hindi and Urdu, spoken primarily in northern In- lated frames from Arabic and Hindi PropBanks for dia and Pakistan, are socially and even officially Urdu PropBanking. Given that Urdu shares its vo- considered two different language varieties. How- cabulary with Arabic, Hindi and Persian, we look ever, such a division between the two is not es- at verbal and nominal predicates that Urdu shares tablished linguistically. They are two standard- with these languages and try to port and adapt their ized registers of what has been called the Hindus- frames from the respective PropBanks instead of tani language, which belongs to the Indo-Aryan creating them afresh. This implies that identifi- language family. Masica (1993) explains that, cation of the source of Urdu predicates becomes while they are different languages officially, they a necessary step in this process. Thus, in order are not even different dialects or sub-dialects in to port the relevant frames, we need to first iden- a linguistic sense; rather, they are different liter- tify the source of Urdu predicates and then extract ary styles based on the same linguistically defined their frames from the related PropBanks. To state sub-dialect. He further explains that at the collo- briefly, we present the following as contributions quial level, Hindi and Urdu are nearly identical, of this paper: both in terms of core vocabulary and grammar. However, at formal and literary levels, vocabu- Automatic identification of origin or source • lary differences begin to loom much larger (Hindi of Urdu vocabulary. 47 Language Technology for Closely Related Languages and Language Variants (LT4CloseLang), pages 47–55, October 29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics Porting and adapting nominal and verbal which it otherwise resembles closely. PropBank’s • predicate frames from the PropBanks of re- ARG0 and ARG1 can be thought of as similar lated languages. to Dowty’s prototypical ‘Agent’ and ‘P atient’ (Dowty, 1991). PropBank provides, for each sense The rest of the paper is organised as follows: In of each annotated verb, its “roleset”, i.e., the possi- the next Section we discuss the Hindi-Urdu tree- ble arguments of the predicate, their labels and all banking project with the focus on PropBanking. possible syntactic realizations. The primary goal In Section 3, we discuss our efforts to automati- of PropBank is to supply consistent, simple, gen- cally identify the source of Urdu vocabulary and eral purpose labeling of semantic roles for a large in Section 4, we discuss the process of adapting quantity of coherent text that can provide training and porting Arabic and Hindi frames for Urdu data for supervised machine learning algorithms, PropBanking. Finally we conclude with some in the same way that the Penn Treebank supported future directions in Section 5. the training of statistical syntactic parsers. 2.1.1 Hindi PropBank 2 A multi-layered, The Hindi PropBank project has differed signif- multi-representational treebank icantly from other PropBank projects in that the Compared to other existing treebanks, Hindi/Urdu semantic role labels are annotated on dependency Treebanks (HTB/UTB) are unusual in that they are trees rather than on phrase structure trees. How- multi-layered. They contain three layers of anno- ever, it is similar in that semantic roles are defined tation: dependency structure (DS) for annotation on a verb-by-verb basis and the description at of modified-modifier relations, PropBank-style the verb-specific level is fine-grained; e.g., a annotation (PropBank) for predicate-argument verb like ‘hit’ will have ‘hitter’ and ‘hittee’. structure, and an independently motivated phrase- These verb-specific roles are then grouped into structure (PS). Each layer has its own framework, broader categories using numbered arguments annotation scheme, and detailed annotation guide- (ARG). Each verb can also have a set of modifiers lines. Due to lack of space and relevance to our not specific to the verb (ARGM). In Table 1, work, we only look at PropBanking with reference PropBank-style semantic roles are listed for to Hindi PropBank, here. the simple verb de ‘to give’. In the table, the numbered arguments correspond to the giver, 2.1 PropBank Annotation thing given and recipient. Frame file definitions The first PropBank, the English PropBank (Kings- are created manually and include role information bury and Palmer, 2002), originated as a one- as well as a unique roleset ID (e.g. de.01 in Table million word subset of the Wall Street Journal 1), which is assigned to every sense of a verb. In (WSJ) portion of Penn Treebank II (an English addition, for Hindi the frame file also includes the phrase structure treebank). The verbs in the Prop- transitive and causative forms of the verb (if any). Bank are annotated with predicate-argument struc- Thus, the frame file for de ‘give’ will include tures and provide semantic role labels for each dilvaa ‘cause to give’. syntactic argument of a verb. Although these were deliberately chosen to be generic and theory- de.01 to give neutral (e.g., ARG0, ARG1), they are intended to consistently annotate the same semantic role Arg0 the giver across syntactic variations. For example, in both Arg1 thing given the sentences John broke the window and The win- Arg2 recipient dow broke, ‘the window’ is annotated as ARG1 and as bearing the role of ‘P atient’. This reflects Table 1: A Frame File the fact that this argument bears the same seman- The annotation process for the PropBank takes tic role in both the cases, even though it is realized place in two stages: the creation of frame files for as the structural subject in one sentence and as the individual verb types, and the annotation of pred- object in the other. This is the primary difference icate argument structures for each verb instance. between PropBank’s approach to semantic role la- The annotation for each predicate in the corpus bels and the Paninian approach to karaka labels, is carried out based on its frame file definitions. 48 The PropBank makes use of two annotation tools Singular Plural viz. Jubilee (Choi et al., 2010b) and Cornerstone khabar khabarain (Choi et al., 2010a) for PropBank instance annota- Direct khabar khabaron tion and PropBank frame file creation respectively. Oblique For annotation of the Hindi and Urdu PropBank, Table 2: Morphological Paradigm of khabar the Jubilee annotation tool had to be modified to display dependency trees and also to provide ad- This explains the efficiency of n-gram based ap- ditional labels for the annotation of empty argu- proaches to either document level or word level ments. language identification tasks as reported in the re- cent literature on the problem (Dunning, 1994; Elfardy and Diab, 2012; King and Abney, 2013; 3 Identifying the source of Urdu Nguyen and Dogruoz, 2014; Lui et al., 2014). Vocabulary In order to predict the source of an Urdu word, we frame two classification tasks: (1) binary classification into Indic and Persio-Arabic and, (2) tri- Predicting the source of a word is similar to lan- class classification into Arabic, Indic and Persian. guage identification where the task is to identify Both the problems are modeled using smoothed n- the language a given document is written in. How- gram based language models. ever, language identification at word level is more challenging than a typical document level lan- 3.1 N-gram Language Models guage identification problem.

Load more