<<

Acquisition Using Morphological Case

Riyaz Ahmad Bhat Dipti Misra Sharma LTRC, IIIT-Hyderabad, India LTRC, IIIT-Hyderabad, India [email protected] [email protected]

Abstract ever, animacy is not seen as a dichotomous vari- able, rather a range capturing finer distinctions of Animacy is an inherent property of en- linguistic relevance. Animacy hierarchy proposed tities that nominals refer to in the phys- in Silverstein’s influential on “animacy hi- ical world. This semantic property of a erarchy” (Silverstein, 1986) ranks nominals on a has received much attention in scale of the following gradience: 1st pers > 2nd pers both and computational linguis- > 3rd anim > 3rd inanim. Several such hierarchies tics. In this paper, present a robust of animacy have been proposed following (Silver- unsupervised technique to infer the ani- stein, 1986). basic scale taken from (Aissen, macy of nominals in with rich 2003) makes a three-way distinction as humans > morphological case. The intuition behind animates > inanimates. These hierarchies can be said our method is that the control/agency of to be based on the likelihood of a referent of a a depicted by case marking can ap- nominal to act as an in an event (Kittila¨ et proximate its animacy. A higher control al., 2011). Thus higher a nominal on these hier- over an action implies higher animacy. Our archies higher the degree of agency/control has experiments on show promising re- over an action. In morphologically rich languages, sults with Fβ and P urity scores of 89 and the degree of agency/control is expressed by case 86 respectively. marking. Case markers capture the degree of con- trol a nominal has in a given context (Hopper and 1 Introduction Thompson, 1980; Butt, 2006). rank nomi- 1 Animacy can either be defined as a biological nals on the continuum of control as shown in (1) . property or a of . In Nominals marked with have highest a strictly biological sense, living entities are ani- control and the ones marked with Locative have mate, while all non living entities are inanimate. lowest. However, in its linguistic sense, the term is syn- Erg > Gen > Inst > Dat > Acc > Loc (1) onymous with a referent’s ability to act or instigate events volitionally (Kittila¨ et al., 2011). Although In this work, we demonstrate that the correla- seemingly different, linguistic animacy can be im- tion between the aforementioned linguistic phe- plied from biological animacy. In linguistics, the nomena is highly systematic, therefore can be ex- manifestation of animacy and its relevance to lin- ploited to predict the animacy of nominals. In or- guistic phenomena have been studied quite exten- der to utilize the correlation between these phe- sively. Animacy has been shown, cross linguisti- nomena for animacy prediction, we choose to use cally, to control a number of linguistic phenomena. an unsupervised learning method. Since, using a Case marking, realization, topicality or supervised learning technique is not always fea- discourse salience are some phenomena highly sible. The resources required to train supervised correlated with the property of animacy (Aissen, algorithms are expensive to create and unlikely to 2003; Bresnan et al., 2007; De Swart et al., 2008; 1Ergative, Genitive, Instrumental, Dative, Accusative and Branigan et al., 2008). In linguistic theory, how- Locative in the given order.

64 International Joint Conference on Natural Processing, pages 64–72, Nagoya, Japan, 14-18 October 2013. exist for the majority of languages. We show that tributional patterns regarding their general syn- an unsupervised learning method can achieve re- tactic and morphological properties. Other works sults comparable to supervised learning in our set- in the direction are (Bowman and Chopra, 2012) ting (see Section 5). Further, based on our case for English and (Baker and Brew, 2010) for En- study of Hindi, we propose that given the mor- glish and Japanese. All these works use super- phological case corresponding to Scale (1), ani- vised learning methods on a manually labeled data macy can be predicted with high precision. Thus, set. These works use highly rich linguistic fea- given the morphological case our approach should tures (e.g., grammatical relations) extracted using be portable to any language. In the context of In- syntactic parsers and anaphora resolution systems. dian languages, in particular, our approach should The major drawback of these approaches is that be easily extendable. In many Indo-Aryan lan- they can not be extended to resource poor lan- guages2, the grammatical cases listed on Scale guages because these languages can not satisfy the (1) are, in fact, morphologically realized (Masica, prerequisites of these approaches. Not only the 1993, p. 230) (Butt and Ahmed, 2011). availability of manually annotated training data, In what follows, we first present the related but also the features used restrict their portabil- work on animacy acquisition in Section 2. In Sec- ity to resource poor languages. Our approach, on tion 3, we will describe our approach for acquiring the other hand, is based on unsupervised learning animacy in Hindi using case markers listed in (2). from raw corpus using a small set of case markers. Section 3.1 describes the data used in our exper- Therefore, it can be extended to any language with iments, followed by discussion on extrac- morphologically realized listed tion and normalization. In Section 4, we discuss on Scale (1). the extraction of data sets from Hindi Wordnet for the evaluation of results of our experiments. In 3 Our Approach Section 5, we describe the results with thorough error analysis and conclude the paper with some As noted by Comrie (1989, p. 62), a nominal can future directions in Section 6. have varying degrees of control in varying con- texts irrespective of its animacy. The noun 2 Related Work the man, for example, is always high in animacy, but it may vary in degree of control. It has high con- In NLP, the role of animacy has been re- trol in the man deliberately hit me and minimal con- cently realized. It provides important informa- trol in hit the man. In morphologically rich lan- tion, to mention a few, for anaphora resolution guages, case markers capture the varying control (Evans and Orasan, 2000), argument disambigua- a nominal has in different contexts. In Hindi, for tion (Dell’Orletta et al., 2005), syntactic pars- example, a nominal, in contexts of high control, ing (Øvrelid and Nivre, 2007), (Bharati et al., occurs with a case listed high on hierar- 2008) and classification (Merlo and Steven- chy (1) (e.g., ergative), while in contexts of low son, 2001). Lexical resources like wordnet usually control is marked with a case marker low on (1) feature animacy of nominals of a given language (e.g., locative). Because of the varying degrees (Fellbaum, 2010; Narayan et al., 2002). How- of control a nominal can have across contexts, ap- ever, using wordnet, as a source for animacy, is proximating animacy from control would be mis- not straightforward. It has its own challenges (Or- leading. Therefore, we generalize the animacy san and Evans, 2001; Orasan and Evans, 2007). of a nominal from its overall distributions in the Also, it’s only a few privileged languages that have corpora. Now the question is, how to general- such lexical resources available. Due to the un- ize the animacy from the mixed behavior that a availability of such resources that could provide nominal displays in a corpora? The linguistic no- animacy information, there have been some no- tion of addresses this problem. An table efforts in the last few years to automati- unmarked observation, in linguistics, means that cally acquire animacy. The important and worth it is more frequent, natural, and predictable than mentioning works in this direction are (Øvrelid, a marked observation (Croft, 2002). Although, a 2006) and (Øvrelid, 2009). The works on given nominal can have varying degrees of control Swedish and Norwegian common nouns using dis- in different contexts irrespective of its animacy, 2Indo-Aryan is a major language family in India. its unmarked behavior should correlate well with

65 its literal animacy, i.e., animates should more fre- using a list of stop . Since, Hindi nouns de- quently be used in contexts of high control while cline for number, gender and case, we use Hindi in-animates should be used in contexts of low con- morph-analyzer, built in-house, to generate lem- trol. A high degree of animacy necessarily implies mas of inflected forms so that their distribu- high degree of control. So the prototypical use of tions can be accumulated under their correspond- animates is in the contexts of high control and of ing lemmas. Further, the distributional of inanimates in the contexts of low control. As the each nominal are scaled to unity so as to guard discussion suggests, animates should occur more against the bias of word frequencies in our clus- frequently with the case markers towards the left tering experiments. Consider a distribution of two of the Scale (1), while inanimates should occur nominals A and B with case markers X and Y . more frequently with the ones towards the right Say A occurs 900 times with X and 100 times of the Scale. Thus, animates should have a left- with Y and B occurs 18 times with X and 2 times skewed distribution on Scale (1), while inanimates with Y . Although, these nominals seem to have should have a right-skewed distribution. different distributions, apart from being similarly In this work, we have exploited the systematic skewed, both of them have similar relative fre- correlations between the linguistic phenomena, as quency of occurrence with X and Y . We aim, discussed, to approximate animacy of Hindi nomi- therefore, to normalize the distributional counts of nals. Our methodology relies on the distributional a nominal with the case markers it occurs with. patterns of a nominal with case markers capturing The distributional counts are normalized to unity its degree of control. Distributions of each nomi- by the frequency of a given nominal in the cor- nal are extracted from a large corpus of Hindi and pora, as shown in (3). This ensures that only then they are clustered using fuzzy cmeans algo- the nominals of similar relative frequency distribu- rithm. Next, we discuss our choice of clustering, tions are clustered together. Beside, normalizing feature extraction and normalization. the distributions, we set a frequency threshold, for a nominal to be included for clustering to > 10, 3.1 Feature Extraction and Normalization which ensures its enough instances to unravel its In order to infer animacy of a nominal, we ex- unmarked or prototypical behavior. tracted its distributions with the case markers cor- xi responding to (1) except genitives3. Case markers x0 = (3) Pk of Hindi corresponding to (1) are listed in (2) (Mo- i=1 xi hanan, 1990, p. 72). x0 is the normalized dimensions in a feature vector of a nominal x. k is the number of coordinates and ne > kaa > se > ko > ko > {mem, par, tak, se, ko} (2) th xi is the i coordinate of x. Since ko and se are ambiguous, as shown in (2), 3.2 Soft Clustering we approximated them to the prototypical cases they are usually used for. ko is approximated to Animacy is an inherent and a non varying property dative while se is approximated to instrumental of entities that nominals refer to. However, due to case. The ambiguity in these case makers, how- lexical ambiguity animacy of a nominal can vary ever, has a profound impact on our results as dis- as the context varies. In Hindi, the ambiguity can cussed in Section 5. A mixed-domain corpora of be attributed to the following: 87 million words is used to ensure enough case • Personal Names: In Hindi, common nouns marked instances of a nominal. The extraction are frequently used as person names or as of distributional counts is simple and straightfor- a component of them. For example, noun ward in Hindi. Words immediately preceding case ‘baadal’ meaning ‘cloud(s)’ can also be used markers are considered as nouns since case mark- as a ‘person name’; similarly ‘vijay’ can ers almost always lie adjacent to the nominals they either mean ‘victory’ or can be a ‘person mark, however, occasionally they are separated by name’. emphatic particles like hi ‘only’. In such cases particles are removed to extract the distribution by • Metonymies: Metonymies or complex types 3Genitives are highly ambiguous in Hindi and hardly dis- (logical polysemy) like institute names, criminate animates from in-animates. country names etc, can refer to a building,

66 a geographical place or a group of individ- assign these nominals to a given number of clus- uals depending on the context of use. These ters such that each cluster contains all and only words are not ambiguous per se but show dif- those nominals that are members of the same class. ferent aspects of their in different Given the ground truth class labels, it is trivial contexts (logically polysemous). For exam- to determine how accurate the clustering results ple, India can either refer to a geographical are. This evaluation set is built using the Hindi place or its inhabitants. wordnet4 (Narayan et al., 2002), a lexical resource composed of synsets and semantic relations. An- These ambiguities imply that some nominals can imacy of a nominal is taken from concept ontolo- belong to both animate and inanimate classes. gies listed in the wordnet. We created two data In order to address this problem of mixed mem- sets using Hindi Wordnet: bership, we used soft clustering approach in this work. In with hard clustering meth- • SET-1: This set contains nominals that are ei- ods, in which a pattern belongs to a single cluster, ther animate or inanimate across senses listed soft clustering algorithms allow patterns to belong in the wordnet. For example, nominals like to all clusters with varying degrees of member- baalak ‘boy’ with all senses animate and ship. One of the most widely used soft clustering patthar ‘stone’ with all senses inanimate algorithms is the fuzzy c-means algorithm (hence- would fall under this set, whereas kuttaa forth FCM) (Bezdek et al., 1984). The FCM algo- ‘dog’ or ‘pawl’ with varying animacy across rithm attempts to partition a finite set of n objects senses would not qualify to be included in K = {k1, ..., kn} into a collection of c fuzzy clus- this set. The sense hierarchies corresponding ters with respect to some given criterion. Given to animate (dog) and inanimate (pawl) senses a finite set of data, the algorithm returns a list of c of noun kuttaa are represented in Figure 1. cluster centers C = {c1, ..., cc} and a partition ma- There are 6039 nominals in this set. It is used trix W = wi,j ∈ [0, 1], i = 1, ..., n, j = 1, ..., c, to evaluate the results and determine the ac- where each element wij tells the degree to which curacy of clustering. k c element i belongs to cluster j. Like the k-means • SET-2: In this set all the nominals listed in algorithm, the FCM aims to minimize an objective wordnet are extracted irrespective of their an- function, given as: imacy. There are around 7030 (SET-1+991) c n nominals in this set. It is used to evaluate the X X m Jm(U, β) = uikDik(xk, βi) (4) borderline cases with equal likelihood to fall i=1 k=1 in any cluster. where uik is the membership of the kth in the ith Noun Noun cluster; Animate Inanimate βi represents the ith cluster prototype; m ≥ 1 is the degree of fuzziness; Fauna Object c ≥ 2 is the number of cluster; n represents the number of data points; Mammal Artifact th Dik(xk, βi) is the Euclidean distance between k object and ith cluster center. Figure 1: Animate and Inanimate Senses of noun kuttaa

4 Evaluation It must be noted that only those nominals that satisfy the marked threshold of >10 are In this section, we discuss the extraction of evalu- considered, as discussed in Subsection 3.1. ation sets for the validation of the clustering re- sults. When a clustering solution has been ob- tained for a data set, it must also be presented in a 5 Experiments and Results manner which provides an overview of the content of each cluster. For that matter, we need an eval- In this section, we will discuss our clustering ex- uation set that can provide class labels for each periments followed by a thorough error analysis of nominal a priori. The clustering task is then to 4http://www.cfilt.iitb.ac.in/wordnet/webhwn/wn.php

67 Animate In-animate Precision Recall F-score Precision Recall F-score Baseline 66.99 44.45 53.44 97.82 39.78 56.56 SVM 78.90 77.70 78.24 95.15 95.43 95.29 Cmeans 57.8 89.18 70.0 97.3 85.65 91.15 Swedish 81.9 64.0 71.8 96.4 98.6 97.5 Table 1: Comparison of Results. the results achieved. In order to put our approach parameters c ‘number of clusters’ and m ‘degree 5 into perspective, we will first setup a baseline and of fuzziness’ set to 2. We used the Fβ and establish a supervised benchmark for the task. For purity to evaluate the accuracy of our clustering both the classification and clustering experiments results, which are two widely used external clus- (discussed shortly), we use SET-1. All the experi- tering evaluation metrics (Manning et al., 2008). ments are performed with the feature vectors rep- In order to evaluate the results, each nominal in resenting the behavior of the corresponding nomi- SET-1 is assigned to the cluster j for which its c nals towards the case system of Hindi. The results cluster membership wk (the degree of member- are listed in Table 1. ship of a nominal k to cluster j) is highest; i.e., c argmaxc∈C {w }. As shown in Table 2, the clus- 5.1 Baseline k tering solution by FCM has achieved Fβ and pu- As discussed in Section 3, animates should occur rity scores of 89 and 86. Further, cluster 1 roughly more frequently with the case markers to the left of corresponds to the Hindi wordnet inanimate class the Scale (2), while inanimates should occur more of nominals (86 recall) and cluster 2 corresponds frequently with the ones to the right of the Scale. to the Hindi wordnet animate class (89 recall). In Thus, we used the frequency of a nominal with the (Øvrelid, 2009) and (Bowman and Chopra, 2012) case markers on the edges of the Scale (2), i.e., ne animate nouns are reported as a difficult class to and mem, to set up the baseline. If a nominal oc- learn. The problem is attributed to the skewness curs more frequently with ne, it is considered as in the training data. Animate nouns occur less animate, whereas if it occurs more frequently with frequently than inanimate nouns. In our cluster- mem, it is considered as inanimate. As the Table ing experiments, however, animates have shown 1 shows, we could only achieve an average recall higher predictability than inanimates. We have of 42 by this approach. This implies that the inter- achieved a high recall on both animate as well as action of a nominal with the overall case system of inanimate nominals. Further, we infer animacy of a language, rather than an individual case marker, all types of nominals while (Øvrelid, 2009) and provides a better picture about its animacy. (Bowman and Chopra, 2012) have restricted the learning only for common noun lemmas. Further- 5.2 Supervised Classification more, our method also identifies ambiguous nom- For supervised classification, we used Support inals, as shown in Table 4. Although less feasible, Vector Machines (SVMs). To train and test the we also present the results produced by Øvrelid SVM classifier, we used the LIBSVM package (2009) (>10) in Table 1 for a rough comparison. (Chang and Lin, 2011). We performed a 5-fold cross validation with a random 80-20 split of SET- Cluster Animate In-animate Fβ Purity 1 for training and testing the classifier. The aver- 1 117 4246 91 97 age accuracies are reported in Table 1. Although, 2 965 711 70 58 the overall accuracy of supervised classification is Total 1082 4957 89 86 higher, it comes with a cost of manual annotation Table 2: Clustering Results on SET-1. of training data. As presented in Table 3, there are 828 instances 5.3 Clustering 5β is a coefficient of the relative strengths of precision and A clustering experiment is performed with FCM recall. We have set its value to 1, for all the results we have clustering algorithm on SET-1 and SET-2, with reported in this paper.

68 of wrong clustering. However, upon close inspec- their inhabitants, teams, governments. Word- tion the clustering of these instances seems theo- net only treats these terms as inanimates retically grounded, thus adding more weight to our (place). Australia, though treated as inani- results. We discuss these instances below: mate in Hindi Wordnet, is clustered with ani- mates in our experiments. 1. Personal Names: As discussed in Section 3.1, personal names are ambiguous and can 6. Machines: A few cases of machines are also be used as common nouns with generic ref- seen to be over generalized as animates. Ma- erence. Hindi Wordnet doesn’t enlist per- chines show an animate like control (directly sonal names (except for very popular names), or indirectly) over an action. though their common usages are listed. For 7. Nouns of Disability: As these expressions example the noun baadal ‘cloud’ is present refer to animates with some disability, they in wordnet while its use as personal name lack any control over an action and are is not listed. In the corpora used for the distributed like inanimates. An example extraction of distributions, around 325 such of this over generalization is noun ghaayal nouns are actually used as personal names. ‘wounded’. Although, these nouns are correctly clustered as animates, they are evaluated as instances 8. Others: These are actual instances of wrong of wrong clustering, because of the inani- clustering and as we noticed, these instances mate sense they have in the Hindi Wordnet. could probably be addressed by choosing an This addresses the problem of low precision optimal frequency threshold to capture the and low purity for animate nominals in our unmarked (prototypical) behavior of a nomi- experiments. Similarly, the names used for nal. We have not addressed the tuning of this gods, goddesses and spirits are also treated parameter in this work. However, we plan to as inanimates in Hindi Wordnet. However, take it up in future. corpus distributions project them as animates due to their high ability to instigate an action. Nominal Type Nominal An example case that was wrongly clustered Personal Name 325 is rab ‘God’. Lower Animate 104 Natural Force 67 2. Lower Animates: Although wordnet lists Psychological Nouns 74 these nominals as animates which in fact they Metonymies 86 are, they are linguistically seen as inanimates Machine 30 and thus are clustered as such. In our experi- Nouns of Disability 44 ments, titlii ‘butterfly’ is clustered with inan- Others 98 imates. Total 828 Table 3: Error Classification on SET-1 3. Natural Forces: These nominals have a high control over an action and their distributions In order to evaluate the ambiguous nominals are more like higher animates. bhuchaal that can have both animate and inanimate refer- ‘earthquake’ is an instance of this over gen- ences in different contexts, we use SET-2. The eralization. borderline cases i.e, the nominals whose cluster c 4. Psychological Nouns: Nouns like pare- membership score wk is ∼0.5 are evaluated shaanii ‘stress’ are conceptualized as a force against the ambiguous nominals listed in SET-2. affecting us psychologically. These nominals As shown in Table 4, from 991 ambiguous nomi- are thus distributed like nominals of high nals 535 are clustered with inanimates in Cluster control, which leads to an over generalization 1, while 439 cases are clustered with animates in of these nouns as animates. Cluster 2. The fact that these nominals posses both the animate and inanimate senses, clustering them 5. Metonymies: Nouns like country names, as in either of the class should not be considered discussed in Section 3.1, apart from refer- wrong. Although they have differing animacy ring to geographical places can also refer to as listed in Hindi Wordnet, probably they have

69 been used only in animate or inanimate sense in 1 1 the corpora used in our experiments. Table 4 also 0.8 0.8 shows that 187 nominals have a uniform distribu- 0.6 0.6 tion over the factors that discriminate animacy. 0.4 0.4 Frequency Frequency Among these 150 nouns are listed as inanimate 0.2 0.2 0 0 in Hindi Wordnet. Upon close inspection, these Erg Inst Dat Loc Erg Inst Dat Loc cases were found to be metonymies. As discussed Animate Prototype Inanimate Prototype earlier, Hindi Wordnet treats metonymies as Figure 2: Skewed Marked Distribution of Cluster Prototypes. inanimate, but in fact they are ambiguous. Thus our clustering of these nominals is justified. • Case Ambiguity or Case : For Cluster Animate In-animate Ambiguous an ideal performance, we expect a separate 1 107 4149 535 case marker for each individual case listed 2 955 658 439 on Scale 1. Unfortunately, case markers are c wk =∼ 0.5 20 150 17 usually ambiguous. A case marker can have Total 1082 4957 991 more than one case function in a language. Table 4: Clustering Results on SET-2 In our work on Hindi, we saw that case am- biguity does have an impact on the results. We could afford to exclude highly ambiguous In Section 3, we stated that the distributions of marker from our experiments nominals will be skewed on the control hierarchy. (Mohanan, 1990). However, how case am- The results have clearly indicated that such skew- biguity will impact the animacy prediction in ness does in fact exist in the data, as shown in other languages remains to be seen. Figure 2. The cluster prototypes, returned by the fuzzy clustering, show animates are left skewed • Nominal Ambiguity: As a matter of fact, an- while inanimates are right skewed on the hierar- imacy is an inherent and a non-varying prop- chy of control. However, in our clustering ex- erty of nominal referents. However, due to periments the order of dative/accusative and in- lexical ambiguity (particularly metonymy), strumental case markers on the control hierarchy animacy of a word form may vary across con- (Scale 1) has been swapped. The dative/accusative texts. We have addressed this problem by case is more biased towards animates while instru- capturing the mixed membership of such am- mental case shows the reverse tendency. The rea- biguous nominals. However, since animacy son for this is the ambiguity in these case mark- of a nominal is judged on the basis of its dis- ers. The se mark roles such as tribution, the animacy of an ambiguous nom- cause, instrument, source and material. Among inal will be biased towards the sense with which cause and instrument imply high control which it occurrs in a corpora. while source and material imply a low control over an action. Almost 82% of instances of instrumen- • Type of : Case marking may be tal case depict a non-causal role while only 18% realized in different ways depending on the show a causal relation as annotated in the Hindi morphological type of a language. In case of dependency treebank (Bhatt et al., 2009). Simi- inflectional and agglutinative languages, case larly, the dative/ ko is used for ex- markers, if present, are bound to a noun stem, periencer , direct and indirect objects (Mo- while in analytical languages they are free hanan, 1990, p. 72). Among these, only direct ob- usually lying adjacent to a nomi- jects realized by definite inanimates are ko marked nal they mark. Although, the way case mark- (Differential Object Marking), thus making it a ers are realized may not the animacy more probable case marker for animates. prediction directly, it may impact the extrac- Before concluding the paper, we will discuss tion of case marked distribution of nominals. some of the issues related to the portability of our Particularly, in case of agglutinative and in- approach to other languages with rich morpholog- flectional languages extracting the multiple ical case. We will briefly discuss these issues be- case marked word forms of a particular noun low: stem could be a challenging task.

70 6 Conclusion Rajesh Bhatt, Bhuvana Narasimhan, Martha Palmer, Owen Rambow, Dipti Misra Sharma, and Fei Xia. In this work we report a technique to exploit the 2009. A multi-representational and multi-layered systematic correspondences between different lin- treebank for hindi/. In Proceedings of the Third guistic phenomena to infer the important semantic Linguistic Annotation Workshop, pages 186–189. Association for Computational Linguistics. category of animacy. The case marked distribu- tions of nominals are clustered with fuzzy cmeans Pushpak Bhattacharyya. 2010. IndoWordNet. clustering into two clusters that approximate the Samuel R Bowman and Harshit Chopra. 2012. Au- binary dimensions of animacy. We achieved sat- tomatic animacy classification. In Proceedings of isfactory results on the binary distinction of nom- the 2012 Conference of the North American Chap- inals on animacy. A Fβ score of 89 and purity ter of the Association for Computational Linguistics: of 86 confirm efficiency of our approach. How- Human Language Technologies: Student Research Workshop, pages 7–10. Association for Computa- ever, the performance of our system can be further tional Linguistics. improved by incorporating features from a depen- dency parser and an anaphora resolution system, Holly P Branigan, Martin J Pickering, and Mikihiro as discussed in (Øvrelid, 2009). Tanaka. 2008. Contributions of animacy to gram- matical function assignment and during In view of the Indo-Wordnet project (Bhat- production. Lingua, 118(2):172–189. tacharyya, 2010) that aims to build wordnet for major Indian languages, our approach can be used Joan Bresnan, Anna Cueni, Tatiana Nikitina, R Harald Baayen, et al. 2007. Predicting the dative alterna- to predict animacy of nouns to leverage the cost tion. Cognitive foundations of interpretation, pages and time associated with manual creation of such 69–94. resources. Given the availability of large data on web for many Indian languages, our method can Miriam Butt and Tafseer Ahmed. 2011. The redevel- opment of Indo-Aryan case systems from a lexical predict this information with satisfactory results. semantic perspective. Morphology, pages 545–572. In the future, we also plan to explore the interac- tion between control and verb semantics, so as to Miriam Butt. 2006. The dative-ergative connection. Empirical issues in and semantics, pages 69– classify based on the amount of control re- 92. quired. This information can also be incorporated into the process of building Indo-wordnets. Chih-Chung Chang and Chih-Jen Lin. 2011. Lib- svm: a library for support vector machines. ACM Acknowledgments Transactions on Intelligent Systems and Technology (TIST), page 27. We would like to thank the anonymous review- Bernard Comrie. 1989. Language universals and lin- ers for their useful comments which helped to im- guistic typology: Syntax and morphology. Univer- prove this paper. We furthermore thank Sambhav sity of Chicago press. Jain for his help and useful feedback. William Croft. 2002. Typology and universals. Cam- bridge University Press.

References Peter De Swart, Monique Lamers, and Sander Judith Aissen. 2003. Differential object marking: Lestrade. 2008. Animacy, argument structure, and Iconicity vs. economy. Natural Language & Lin- argument encoding. Lingua, 118(2):131–140. guistic Theory, pages 435–483. Felice Dell’Orletta, Alessandro Lenci, Simonetta Mon- Kirk Baker and Chris Brew. 2010. Multilingual an- temagni, and Vito Pirrelli. 2005. Climbing the path imacy classification by sparse logistic regression. to grammar: A maximum entropy model of sub- Information Concerning OSDL OHIO STATE DIS- ject/object learning. In Proceedings of the Work- SERTATIONS IN LINGUISTICS, page 52. shop on Psychocomputational Models of Human Language Acquisition, pages 72–81. Association for James C Bezdek, Robert Ehrlich, and William Full. Computational Linguistics. 1984. FCM: The fuzzy c-means clustering algo- rithm. Computers & Geosciences, pages 191–203. Richard Evans and Constantin Orasan. 2000. Im- proving anaphora resolution by identifying ani- Akshar Bharati, Samar Husain, Bharat Ambati, Samb- mate entities in texts. In Proceedings of the Dis- hav Jain, Dipti Sharma, and Rajeev Sangal. 2008. course Anaphora and Reference Resolution Confer- Two semantic features make all the difference in ence (DAARC2000), pages 154–162. parsing accuracy. Proceedings of International Con- ference on Natural Language Processing (ICON08). Christiane Fellbaum. 2010. WordNet. Springer.

71 Paul J Hopper and Sandra A Thompson. 1980. Tran- sitivity in grammar and discourse. Language, pages 251–299. Seppo Kittila,¨ Katja Vasti,¨ and Jussi Ylikoski. 2011. Case, Animacy and Semantic Roles. John Ben- jamins Publishing. Christopher D Manning, Prabhakar Raghavan, and Hinrich Schutze.¨ 2008. Introduction to information retrieval. Cambridge University Press Cambridge.

Colin P Masica. 1993. The Indo-Aryan Languages. Cambridge University Press.

Paola Merlo and Suzanne Stevenson. 2001. Automatic verb classification based on statistical distributions of argument structure. Computational Linguistics, pages 373–408.

Tara Mohanan. 1990. Arguments in Hindi. Ph.D. the- sis, Stanford University. Dipak Narayan, Debasri Chakrabarty, Prabhakar Pande, and Pushpak Bhattacharyya. 2002. An ex- perience in building the indo wordnet-a wordnet for hindi. In First International Conference on Global WordNet, Mysore, India. Constantin Orasan and Richard Evans. 2007. Np ani- macy identification for anaphora resolution. J. Artif. Intell. Res.(JAIR), 29:79–103. Constantin Orsan and Richard Evans. 2001. Learning to identify animate references. In Proceedings of the 2001 workshop on Computational Natural Lan- guage Learning-Volume 7, page 16. Association for Computational Linguistics. Lilja Øvrelid and Joakim Nivre. 2007. When word order and part-of-speech tags are not enough – Swedish dependency parsing with rich linguistic features. In Proceedings of the International Con- ference on Recent Advances in Natural Language Processing (RANLP), pages 447–451.

Lilja Øvrelid. 2006. Towards robust animacy clas- sification using morphosyntactic distributional fea- tures. In Proceedings of the 2006 Conference of the European Chapter of the Association for Computa- tional Linguistics (EACL): Student Research Work- shop, pages 47–54. Lilja Øvrelid. 2009. Empirical evaluations of animacy annotation. In Proceedings of the 12th Conference of the European Chapter of the Association for Com- putational Linguistics (EACL). Michael Silverstein. 1986. Hierarchy of features and ergativity. Features and projections, pages 163–232.

72