Towards Robust Animacy Classification Using
Total Page:16
File Type:pdf, Size:1020Kb
Towards Robust Animacy Classification Using Morphosyntactic Distributional Features Lilja Øvrelid NLP-unit, Dept. of Swedish G¨oteborg University SE-40530 G¨oteborg, Sweden [email protected] Abstract ranging from coreference resolution to parsing and generation. This paper presents results from ex- In recent years a range of linguistic studies have periments in automatic classification of examined the influence of argument animacy in animacy for Norwegian nouns using grammatical phenomena such as differential ob- decision-tree classifiers. The method ject marking (Aissen, 2003), the passive construc- makes use of relative frequency measures tion (Dingare, 2001), the dative alternation (Bres- for linguistically motivated morphosyn- nan et al., 2005), etc. A variety of languages are tactic features extracted from an automati- sensitive to the dimension of animacy in the ex- cally annotated corpus of Norwegian. The pression and interpretation of core syntactic argu- classifiers are evaluated using leave-one- ments (Lee, 2002; Øvrelid, 2004). A key general- out training and testing and the initial re- isation or tendency observed there is that promi- sults are promising (approaching 90% ac- nent grammatical features tend to attract other curacy) for high frequency nouns, however prominent features;1 subjects, for instance, will deteriorate gradually as lower frequency tend to be animate and agentive, whereas objects nouns are classified. Experiments at- prototypically are inanimate and themes/patients. tempting to empirically locate a frequency Exceptions to this generalisation express a more threshold for the classification method in- marked structure, a property which has conse- dicate that a subset of the chosen mor- quences, for instance, for the distributional prop- phosyntactic features exhibit a notable re- erties of the structure in question. silience to data sparseness. Results will be Even though knowledge about the animacy of presented which show that the classifica- a noun clearly has some interesting implications, tion accuracy obtained for high frequency little work has been done within the field of lex- > nouns (with absolute frequencies 1000) ical acquisition in order to automatically acquire can be maintained for nouns with consid- such knowledge. Or˘asan and Evans (2001) make ∼ erably lower frequencies ( 50) by back- use of hyponym-relations taken from the Word Net ing off to a smaller set of features at clas- resource (Fellbaum, 1998) in order to classify ani- sification. mate referents. However, such a method is clearly restricted to languages for which large scale lexi- 1 Introduction cal resources, such as the Word Net, are available. Animacy is a an inherent property of the referents Merlo and Stevenson (2001) present a method for of nouns which has been claimed to figure as an verb classification which relies only on distribu- influencing factor in a range of different gram- tional statistics taken from corpora in order to train matical phenomena in various languages and it a decision tree classifier to distinguish between is correlated with central linguistic concepts such three groups of intransitive verbs. as agentivity and discourse salience. Knowledge 1The notion of prominence has been linked to several about the animacy of a noun is therefore rele- properties such as most likely as topic, agent, most available vant for several different kinds of NLP problems referent, etc. 47 This paper presents experiments in automatic tioned earlier, a prototypical transitive rela- classification of the animacy of unseen Norwe- tion involves an animate subject and an inan- gian common nouns, inspired by the method for imate object. In fact, a corpus study of an- verb classification presented in Merlo and Steven- imacy distribution in simple transitive sen- son (2001). The learning task is, for a given com- tences in Norwegian revealed that approxi- mon noun, to classify it as either belonging to the mately 70% of the subjects of these types class animate or inanimate. Based on correlations of sentences were animate, whereas as many between animacy and other linguistic dimensions, as 90% of the objects were inanimate (Øvre- a set of morphosyntactic features is presented and lid, 2004). Although this corpus study in- shown to differentiate common nouns along the volved all types of nominal arguments, in- binary dimension of animacy with promising re- cluding pronouns and proper nouns, it still sults. The method relies on aggregated relative fre- seems that the frequency with which a cer- quencies for common noun lemmas, hence might tain noun occurs as a subject or an object of be expected to seriously suffer from data sparse- a transitive verb might be an indicator of its ness. Experiments attempting to empirically lo- animacy. cate a frequency threshold for the classification method will therefore be presented. It turns out Demoted agent in passive Agentivity is another that a subset of the chosen morphosyntactic ap- related notion to that of animacy, animate be- proximators of animacy show a resilience to data ings are usually inherently sentient, capable sparseness which can be exploited in classifica- of acting volitionally and causing an event to tion. By backing off to this smaller set of features, take place - all properties of the prototypi- we show that we can maintain the same classifica- cal agent (Dowty, 1991). The passive con- tion accuracy also for lower frequency nouns. struction, or rather the property of being ex- The rest of the paper is structured as follows. pressed as the demoted agent in a passive Section 2 identifies and motivates the set of chosen construction, is a possible approximator of features for the classification task and describes agentivity. It is well known that transitive how these features are approximated through fea- constructions tend to passivize better (hence ture extraction from an automatically annotated more frequently) if the demoted subject bears corpus of Norwegian. In section 3, a group of ex- a prominent thematic role, preferably agent. periments testing the viability of the method and Anaphoric reference by personal pronoun chosen features is presented. Section 4 goes on to Anaphoric reference is a phenomenon where investigate the effect of sparse data on the clas- the animacy of a referent is clearly expressed. sification performance and present experiments The Norwegian personal pronouns distin- which address possible remedies for the sparse guish their antecedents along the animacy data problem. Section 5 sums up the main find- dimension - animate han/hun ‘he/she’ vs. ings of the previous sections and outlines a few inanimate den/det ‘it-MASC/NEUT’. suggestions for further research. Anaphoric reference by reflexive pronoun 2 Features of animacy Reflexive pronouns represent another form As mentioned above, animacy is highly correlated of anaphoric reference, and, may, in contrast with a number of other linguistic concepts, such to the personal pronouns locate their an- as transitivity, agentivity, topicality and discourse tecedent locally, i.e. within the same clause. salience. The expectation is that marked configu- In the prototypical reflexive construction rations along these dimensions, e.g. animate ob- the subject and the reflexive object are jects or inanimate agents, are less frequent in the coreferent and it describes an action directed data. However, these are complex notions to trans- at oneself. Although the reflexive pronoun in late into extractable features from a corpus. In Norwegian does not distinguish for animacy, the following we will present some morphological the agentive semantics of the construction and syntactic features which, in different ways, ap- might still favour an animate subject. proximate the multi-faceted property of animacy: Genitive -s There is no extensive case system for Transitive subject and (direct) object As men- common nouns in Norwegian and the only 48 distinction that is explicitly marked on the With regard to the property of anaphoric ref- noun is the genitive case by addition of -s. erence by personal pronouns, the extraction was The genitive construction typically describes bound to be a bit more difficult. The anaphoric possession, a relation which often involves an personal pronoun is never in the same clause as animate possessor. the antecedent, and often not even in the same sen- tence. Coreference resolution is a complex prob- 2.1 Feature extraction lem, and certainly not one that we shall attempt to In order to train a classifier to distinguish between solve in the present context. However, we might animate and inanimate nouns, training data con- attempt to come up with a metric that approxi- sisting of distributional statistics on the above fea- mates the coreference relation in a manner ade- tures were extracted from a corpus. For this end, quate for our purposes, that is, which captures the a 15 million word version of the Oslo Corpus, a different coreference relation for animate as op- corpus of Norwegian texts of approximately 18.5 posed to inanimate nouns. To this end, we make 2 million words, was employed. The corpus is mor- use of the common assumption that a personal pro- phosyntactically annotated and assigns an under- noun usually refers to a discourse salient element specified dependency-style analysis to each sen- which is fairly recent in the discourse. Now, if 3 tence. a sentence only contains one core argument (i.e. For each noun, relative frequencies for the dif- an intransitive subject) and it is followed