Towards an Annotated Corpus of Discourse Relations in Hindi
Total Page:16
File Type:pdf, Size:1020Kb
The 6th Workshop on Asian Languae Resources, 2008 Towards an Annotated Corpus of Discourse Relations in Hindi Rashmi Prasad*, Samar Husain†, Dipti Mishra Sharma† and Aravind Joshi* explicitly realized in the text. For example, in (1), Abstract the causal relation between ‘the federal government suspending US savings bonds sales’ We describe our initial efforts towards and ‘Congress not lifting the ceiling on developing a large-scale corpus of Hindi government debt’ is expressed with the explicit texts annotated with discourse relations. connective ‘because’.2 The two arguments of each Adopting the lexically grounded approach connective are also annotated, and the annotations of the Penn Discourse Treebank (PDTB), of both connectives and their arguments are we present a preliminary analysis of recorded in terms of their text span offsets.3 discourse connectives in a small corpus. We describe how discourse connectives are (1) The federal government suspended sales of U.S. represented in the sentence-level savings bonds because Congress hasn’t lifted the dependency annotation in Hindi, and ceiling on government debt. discuss how the discourse annotation can enrich this level for research and One of the questions that arises is how the applications. The ultimate goal of our work PDTB style annotation can be carried over to is to build a Hindi Discourse Relation Bank languages other than English. It may prove to be a along the lines of the PDTB. Our work will challenge cross-linguistically, as the guidelines and also contribute to the cross-linguistic methodology appropriate for English may not understanding of discourse connectives. apply as well or directly to other languages, especially when they differ greatly in syntax and 1 Introduction morphology. To date, cross-linguistic An increasing interest in human language investigations of connectives in this direction have technologies such as textual summarization, been carried out for Chinese (Xue, 2005) and question answering, natural language generation Turkish (Deniz and Webber, 2008). This paper has recently led to the development of several explores discourse relation annotation in Hindi, a discourse annotation projects aimed at creating language with rich morphology and free word large scale resources for natural language order. We describe our study of “explicit processing. One of these projects is the Penn connectives” in a small corpus of Hindi texts, Discourse Treebank (PDTB Group, 2006),1whose discussing them from two perspectives. First, we goal is to annotate the discourse relations holding consider the type and distribution of Hindi between eventualities described in a text, for connectives, proposing to annotate a wider range example causal and contrastive relations. The PDTB is unique in using a lexically grounded 2 The PDTB also annotates implicit discourse relations, but approach for annotation: discourse relations are only locally, between adjacent sentences. Annotation here anchored in lexical items (called “explicit consists of providing connectives (called “implicit discourse connectives”) to express the inferred relation. Implicit discourse connectives”) whenever they are connectives are beyond the scope of this paper, but will be taken up in future work. 3 The PDTB also records the senses of the connectives, and * University of Pennsylvania, Philadelphia, PA, USA, each connective and its arguments are also marked for their {rjprasad,joshi}@seas.upenn.edu attribution. Sense annotation and attribution annotation are not † Language Technologies Research Centre, IIIT, Hyderabad, discussed in this paper. We will, of course, pursue these India, [email protected], [email protected] aspects in our future work concerning the building of a Hindi 1 http://www.seas.upenn.edu/˜pdtb Discourse Relation Bank. 73 The 6th Workshop on Asian Languae Resources, 2008 of connectives than the PDTB. Second, we language. We did try to create a list from our own consider how the connectives are represented in knowledge of the language and grammar, and also the Hindi sentence-level dependency annotation, in by translating the list of English connectives in the particular discussing how the discourse annotation PDTB. However, when we started looking at real can enrich the sentence-level structures. We also data, this list proved to be incomplete. For briefly discuss issues involved in aligning the example, we discovered that the form of the discourse and sentence-level annotations. complementizer ‘ik’ also functions as a temporal Section 2 provides a brief description of Hindi subordinator, as in (3). word order and morphology. In Section 3, we present our study of the explicit connectives (3) [ vah baalaTI ko gaMdo panaI sao ApnaI caaOklaoT identified in our texts, discussing them in light of [he bucket of dirty water from his chocolates the PDTB. Section 4 describes how connectives inakalanao hI vaalaa qaa] ik {]sakI mammaI nao are represented in the sentence-level dependency taking-out just doing was] that {his mother ERG annotation in Hindi. Finally, Section 5 concludes ]sao raok idyaa } with a summary and future work. him stop did} “He was just going to take out the chocolates from 2 Brief Overview of Hindi Syntax and the dirty water in the bucket when his mother stopped him.” Morphology Hindi is a free word order language with SOV as The method of collecting connectives will the default order. This can be seen in (2), where therefore necessarily involve “discovery during (2a) shows the constituents in the default order, annotation”. However, we wanted to get some and the remaining examples show some of the initial ideas about what kinds of connectives were word order variants of (2a). likely to occur in real text, and to this end, we looked at 9 short stories with approximately 8000 (2) a. malaya nao samaIr kao iktaba dI . words. Our goal here is to develop an initial set of malay ERG sameer DAT book gave guidelines for annotation, which will be done on 4 “Malay gave the book to Sameer” (S-IO-DO-V) the same corpus on which the sentence-level b. malaya nao iktaba samaIr kao dI. (S-DO-IO-V) dependency annotation is being carried out (see c. samaIr kao malaya nao iktaba dI. (IO-S-DO-V) Section 4). Table 1 provides the full set of d. samaIr kao iktaba malaya nao dI. (IO-DO-S-V) connectives we found in our texts, grouped by e. iktaba malaya nao samaIr kao dI. (DO-S-IO-V) syntactic type. The first four columns give the f. iktaba samaIr kao malaya nao dI. (DO-IO-S-V) syntactic grouping, the Hindi connective expressions, the English gloss, and the English Hindi also has a rich case marking system, equivalent expressions, respectively. The last although case marking is not obligatory. For column gives the number of occurrences we found example, in (2), while the subject and indirect of each expression. In the rest of this section, we object are explicitly for the ergative (ERG) and describe the function and distribution of discourse dative (DAT) cases, the direct object is unmarked connectives in Hindi based on our texts. In the for the accusative. discussion, we have noted our points of departure from the PDTB where applicable, both with 3 Discourse Connectives in Hindi respect to the types of relations being annotated as Given the lexically grounded approach adopted for well as with respect to terminology. For argument discourse annotation, the first question that arises naming, we use the PDTB convention: the clause is how to identify discourse connectives in Hindi. with which the connective is syntactically Unlike the case of the English connectives in the associated is called Arg2 and the other clause is PDTB, there are no resources that alone or together called Arg1. Two special conventions are followed provide an exhaustive list of connectives in the for paired connectives, which we describe below. In all Hindi examples in this paper, Arg1 is enclosed in square brackets and Arg2 is in braces. 4 S=Subject; IO=Indirect Object; DO=Direct Object; V=Verb; ERG=Ergative; DAT=Dative 74 The 6th Workshop on Asian Languae Resources, 2008 Connective Type Hindi Gloss English Num Sub. Conj. @yaaoMik why-that because 2 (@yaaoM)ik..[salaIe (why)-that..this-for because 3 (Agar|yadI)..tba|tao (if)..then if..(then) 15 (jaba).. tba|tao (when)..then when 50 jaba tk.. tba tk (ko ilae) when till..then till (of for) until 2 jaOsao hI..(tao) as just..(then) as soon as 5 [tnaa|eosaa..kI so|such..that so that 12 taik so-that so that 1 ik that when 5 Sentential Relatives ijasasao which-with because of which 5 jaao which because of which 1 ijasako karNa which-of reason because of which 1 Subordinator pr upon upon 9 (-kr|-ko|krko) (do) after|while 111 samaya time while 1 hue happening while 28 ko baad of later after 3 sao with due to 1 ko phlao of before before 1 ko ilae of for in order to 4 maoM in while 1 ko karNa of reason because of 3 Coord. Conj. laoikna|pr|prntu but but 51 AaOr|tqaa and and 117 yaa or or 2 yaaoM tao..pr such TOP..but but 2 naa kovala..balaik not only..but not only..but 1 Adverbial tba then then 2 baad maoM later in later 5 ifr then then 4 [saIilae this-for that is why 7 nahIM tao not then otherwise 5 tBaI tao then-only TOP that is why 1 saao so so 10 vahI|yahI nahIM that|this-only not not only that 1 TOTAL 472 Table 1: A Partial List of Discourse Connectives in Hindi. Parentheses are used for optional elements; “|” is used for alternating elements; TOP = topic marker. 3.1 Types of Discourse Connectives DAT give would], why-that {he-EMPH all 3.1.1 Subordinating Conjunctions QartI kI sampda ka svaamaI hO} earth of wealth of lord is} Finite adverbial subordinate clauses are “I would give all this wealth to the king, because he introduced by independent lexical items called alone is the lord of this whole world’s wealth.” “subordinating conjunctions”, such as @yaaoMik (‘because’), as in (4), and they typically occur as As the first group in Table 1 shows, right or left attached to the main clause.