Animacy Acquisition Using Morphological Case Riyaz Ahmad Bhat Dipti Misra Sharma LTRC, IIIT-Hyderabad, India LTRC, IIIT-Hyderabad, India [email protected] [email protected] Abstract ever, animacy is not seen as a dichotomous vari- able, rather a range capturing finer distinctions of Animacy is an inherent property of en- linguistic relevance. Animacy hierarchy proposed tities that nominals refer to in the phys- in Silverstein’s influential article on “animacy hi- ical world. This semantic property of a erarchy” (Silverstein, 1986) ranks nominals on a nominal has received much attention in scale of the following gradience: 1st pers > 2nd pers both linguistics and computational linguis- > 3rd anim > 3rd inanim. Several such hierarchies tics. In this paper, we present a robust of animacy have been proposed following (Silver- unsupervised technique to infer the ani- stein, 1986). One basic scale taken from (Aissen, macy of nominals in languages with rich 2003) makes a three-way distinction as humans > morphological case. The intuition behind animates > inanimates. These hierarchies can be said our method is that the control/agency of to be based on the likelihood of a referent of a a noun depicted by case marking can ap- nominal to act as an agent in an event (Kittila¨ et proximate its animacy. A higher control al., 2011). Thus higher a nominal on these hier- over an action implies higher animacy. Our archies higher the degree of agency/control it has experiments on Hindi show promising re- over an action. In morphologically rich languages, sults with Fβ and P urity scores of 89 and the degree of agency/control is expressed by case 86 respectively. marking. Case markers capture the degree of con- trol a nominal has in a given context (Hopper and 1 Introduction Thompson, 1980; Butt, 2006). They rank nomi- 1 Animacy can either be defined as a biological nals on the continuum of control as shown in (1) . property or a grammatical category of nouns. In Nominals marked with Ergative case have highest a strictly biological sense, living entities are ani- control and the ones marked with Locative have mate, while all non living entities are inanimate. lowest. However, in its linguistic sense, the term is syn- Erg > Gen > Inst > Dat > Acc > Loc (1) onymous with a referent’s ability to act or instigate events volitionally (Kittila¨ et al., 2011). Although In this work, we demonstrate that the correla- seemingly different, linguistic animacy can be im- tion between the aforementioned linguistic phe- plied from biological animacy. In linguistics, the nomena is highly systematic, therefore can be ex- manifestation of animacy and its relevance to lin- ploited to predict the animacy of nominals. In or- guistic phenomena have been studied quite exten- der to utilize the correlation between these phe- sively. Animacy has been shown, cross linguisti- nomena for animacy prediction, we choose to use cally, to control a number of linguistic phenomena. an unsupervised learning method. Since, using a Case marking, argument realization, topicality or supervised learning technique is not always fea- discourse salience are some phenomena highly sible. The resources required to train supervised correlated with the property of animacy (Aissen, algorithms are expensive to create and unlikely to 2003; Bresnan et al., 2007; De Swart et al., 2008; 1Ergative, Genitive, Instrumental, Dative, Accusative and Branigan et al., 2008). In linguistic theory, how- Locative in the given order. 64 International Joint Conference on Natural Language Processing, pages 64–72, Nagoya, Japan, 14-18 October 2013. exist for the majority of languages. We show that tributional patterns regarding their general syn- an unsupervised learning method can achieve re- tactic and morphological properties. Other works sults comparable to supervised learning in our set- in the direction are (Bowman and Chopra, 2012) ting (see Section 5). Further, based on our case for English and (Baker and Brew, 2010) for En- study of Hindi, we propose that given the mor- glish and Japanese. All these works use super- phological case corresponding to Scale (1), ani- vised learning methods on a manually labeled data macy can be predicted with high precision. Thus, set. These works use highly rich linguistic fea- given the morphological case our approach should tures (e.g., grammatical relations) extracted using be portable to any language. In the context of In- syntactic parsers and anaphora resolution systems. dian languages, in particular, our approach should The major drawback of these approaches is that be easily extendable. In many Indo-Aryan lan- they can not be extended to resource poor lan- guages2, the grammatical cases listed on Scale guages because these languages can not satisfy the (1) are, in fact, morphologically realized (Masica, prerequisites of these approaches. Not only the 1993, p. 230) (Butt and Ahmed, 2011). availability of manually annotated training data, In what follows, we first present the related but also the features used restrict their portabil- work on animacy acquisition in Section 2. In Sec- ity to resource poor languages. Our approach, on tion 3, we will describe our approach for acquiring the other hand, is based on unsupervised learning animacy in Hindi using case markers listed in (2). from raw corpus using a small set of case markers. Section 3.1 describes the data used in our exper- Therefore, it can be extended to any language with iments, followed by discussion on feature extrac- morphologically realized grammatical case listed tion and normalization. In Section 4, we discuss on Scale (1). the extraction of data sets from Hindi Wordnet for the evaluation of results of our experiments. In 3 Our Approach Section 5, we describe the results with thorough error analysis and conclude the paper with some As noted by Comrie (1989, p. 62), a nominal can future directions in Section 6. have varying degrees of control in varying con- texts irrespective of its animacy. The noun phrase 2 Related Work the man, for example, is always high in animacy, but it may vary in degree of control. It has high con- In NLP, the role of animacy has been re- trol in the man deliberately hit me and minimal con- cently realized. It provides important informa- trol in I hit the man. In morphologically rich lan- tion, to mention a few, for anaphora resolution guages, case markers capture the varying control (Evans and Orasan, 2000), argument disambigua- a nominal has in different contexts. In Hindi, for tion (Dell’Orletta et al., 2005), syntactic pars- example, a nominal, in contexts of high control, ing (Øvrelid and Nivre, 2007), (Bharati et al., occurs with a case marker listed high on hierar- 2008) and verb classification (Merlo and Steven- chy (1) (e.g., ergative), while in contexts of low son, 2001). Lexical resources like wordnet usually control is marked with a case marker low on (1) feature animacy of nominals of a given language (e.g., locative). Because of the varying degrees (Fellbaum, 2010; Narayan et al., 2002). How- of control a nominal can have across contexts, ap- ever, using wordnet, as a source for animacy, is proximating animacy from control would be mis- not straightforward. It has its own challenges (Or- leading. Therefore, we generalize the animacy san and Evans, 2001; Orasan and Evans, 2007). of a nominal from its overall distributions in the Also, it’s only a few privileged languages that have corpora. Now the question is, how to general- such lexical resources available. Due to the un- ize the animacy from the mixed behavior that a availability of such resources that could provide nominal displays in a corpora? The linguistic no- animacy information, there have been some no- tion of markedness addresses this problem. An table efforts in the last few years to automati- unmarked observation, in linguistics, means that cally acquire animacy. The important and worth it is more frequent, natural, and predictable than mentioning works in this direction are (Øvrelid, a marked observation (Croft, 2002). Although, a 2006) and (Øvrelid, 2009). The works focus on given nominal can have varying degrees of control Swedish and Norwegian common nouns using dis- in different contexts irrespective of its animacy, 2Indo-Aryan is a major language family in India. its unmarked behavior should correlate well with 65 its literal animacy, i.e., animates should more fre- using a list of stop words. Since, Hindi nouns de- quently be used in contexts of high control while cline for number, gender and case, we use Hindi in-animates should be used in contexts of low con- morph-analyzer, built in-house, to generate lem- trol. A high degree of animacy necessarily implies mas of inflected word forms so that their distribu- high degree of control. So the prototypical use of tions can be accumulated under their correspond- animates is in the contexts of high control and of ing lemmas. Further, the distributional counts of inanimates in the contexts of low control. As the each nominal are scaled to unity so as to guard discussion suggests, animates should occur more against the bias of word frequencies in our clus- frequently with the case markers towards the left tering experiments. Consider a distribution of two of the Scale (1), while inanimates should occur nominals A and B with case markers X and Y . more frequently with the ones towards the right Say A occurs 900 times with X and 100 times of the Scale. Thus, animates should have a left- with Y and B occurs 18 times with X and 2 times skewed distribution on Scale (1), while inanimates with Y . Although, these nominals seem to have should have a right-skewed distribution. different distributions, apart from being similarly In this work, we have exploited the systematic skewed, both of them have similar relative fre- correlations between the linguistic phenomena, as quency of occurrence with X and Y .
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages9 Page
-
File Size-