Multi-Class Animacy Classification with Semantic Features

Multi-class Animacy Classification with Semantic Features Johannes Bjerva Center for Language and Cognition Groningen University of Groningen The Netherlands [email protected] Abstract such a binary scheme, examples of animate nouns might include author and dog, while examples Animacy is the semantic property of nouns of inanimate nouns might include table and rock. denoting whether an entity can act, or is More elaborate schemes tend to represent a hier- perceived as acting, of its own will. This archy or continuum typically ranging from HU- property is marked grammatically in var- MAN NON-HUMAN INANIMATE (cf. Com- → → ious languages, albeit rarely in English. rie (1989)), with other categories in between. It has recently been highlighted as a rele- In various languages, animacy affects linguis- vant property for NLP applications such as tic phenomena such as case marking and argument parsing and anaphora resolution. In order realization. Furthermore, hierarchical restrictions for animacy to be used in conjunction with are often imposed by animacy, e.g. with subjects other semantic features for such applica- tending to be higher in an animacy hierarchy than tions, appropriate data is necessary. How- objects (Dahl and Fraurud, 1996). Even though ever, the few corpora which do contain animacy is rarely overtly marked in English, it still animacy annotation, rarely contain much influences the choice of certain grammatical struc- other semantic information. The addition tures, such as the choice of relative pronouns (e.g. of such an annotation layer to a corpus al- who vs. which). ready containing deep semantic annotation The aims of this work are as follows: (i) to im- should therefore be of particular interest. prove upon the state of the art in multi-class an- The work presented in this paper contains imacy classification by comparing and evaluating three main contributions. Firstly, we im- different classifiers and features for this task, (ii) to prove upon the state of the art in multi- investigate whether a corpus of spoken language class animacy classification. Secondly, we containing animacy annotation can be used as a use this classifier to contribute to the anno- basis to annotate animacy in a corpus of written tation of an openly available corpus con- language, (iii) to use the resulting classifier as part taining deep semantic annotation. Finally, of the toolchain used to annotate a corpus contain- we provide source code, as well as trained ing deep semantic annotation. models and scripts needed to reproduce The remainder of this paper is organized as fol- the results presented in this paper, or aid lows: In Section 2 we go through the relevance of in annotation of other texts.1 animacy for Natural Language Processing (NLP) and describe some corpora which contain animacy 1 Introduction annotation. Previous attempts and approaches to animacy classification are portrayed in Section 3. Animacy is the semantic property of nouns de- Section 4 contains an overview of the data used noting whether, or to what extent, the referent in this study, as well as details regarding the man- of that noun is alive, human-like or even cogni- ual annotation of animacy carried out as part of tively sophisticated. Several ways of characteris- this work. The methods employed and the results ing the animacy of such referents have been pro- obtained are presented in Sections 5 and 6. The posed in the literature, the most basic distinction discussion is given in Section 7. Finally, Section 8 being between animate and inanimate entities. In contains conclusions and some suggestions for fu- 1https://github.com/bjerva/animacy ture work in multi-class animacy classification. 65 Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 65–75, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics 2 Background paper focusses on English, a potential application of this corpus is discussed at the end of this paper 2.1 Relevance of animacy for NLP (see Section 8). Although seemingly overlooked in the past, ani- The NXT Switchboard corpus represents a macy has recently been shown to be an impor- larger and more interesting resource for our pur- tant feature for NLP. Øvrelid & Nivre (2007) poses (Calhoun et al., 2010). This spoken lan- found that the accuracy of a dependency parser for guage corpus contains high quality manual anno- Swedish could be improved by incorporating a bi- tation of animacy for nearly 200,000 noun phrases nary animacy distinction. Other work has high- (Zaenen et al., 2004). Furthermore, the annota- lighted animacy as relevant for anaphora and co- tion is fairly fine-grained, as a total of ten animacy reference resolution (Orasan˘ and Evans, 2007; Lee categories are used (see Table 1), with a few addi- et al., 2013) and verb argument disambiguation tional tags for mixed animacy and cases in which (Dell’Orletta et al., 2005). annotators were uncertain. This scheme can be Furthermore, in English, the choices for dative arranged hierarchically, so that the classes Con- alternation (Bresnan et al., 2007), between geni- crete, Non-concrete, Place and Time are grouped tive constructions (Stefanowitsch, 2003), and be- as inanimate, while the remaining classes are tween active and passive voice (Rosenbach, 2008) grouped as animate. The availability of this data are also affected by the animacy of their con- allows us to easily exploit the annotation for a su- stituent nouns. With this in mind, Zaenen et al. pervised learning approach (see Section 5). (2004) suggest that animacy, for languages such as English, is not a matter of grammatical and un- 3 Related work grammatical sentences, but rather of sentences being more and less felicitous. This highlights anno- In this section we will give an overview of previ- tation of animacy as potentially particularly useful ous work in animacy classification, some of which for applications such as Natural Language Gener- has inspired the approach presented in this paper. ation. 3.1 Exploiting corpus frequencies In spite of this, animacy appears to be rarely annotated in corpora, and thus also rather rarely used A binary animacy classifier which uses syntactic in tools and algorithms for NLP (although some and morphological features has been previously recent efforts do exist, cf. Moore et al. (2013)). developed for Norwegian and Swedish (Øvrelid, Furthermore, the few corpora that do include ani- 2005; Øvrelid, 2006; Øvrelid, 2009). The fea- macy in their annotation do not contain much other tures used are based on frequency counts from the semantic annotation, making them less interesting dependency-parsed Talbanken05 corpus. These for computational semanticists. frequencies are counted per noun lemma, mean- ing that this classifier is not context sensitive. In 2.2 Annotation of animacy other words, cases of e.g. polysemy where head is Resources annotated with animacy are few and inanimate in the sense of human head, but animate far between. One such resource is the MC160 in the sense of head of an organization, are likely dataset which has recently been labelled for bi- to be problematic. Intuitively, by taking context or nary animacy (Moore et al., 2013). The distinc- semantically motivated features into account, such tion between animate and inanimate was based on cases ought to be resolved quite trivially. whether or not an entity could “move under its This classifier performs well, as it reaches an own will”. Although interesting, the size of this accuracy for 96.8% for nouns, as compared to a data set (approximately 8,000 annotated nouns) baseline of 90.5% when always picking the most limits its usefulness, particularly with the methods common class (Øvrelid, 2009). Furthermore, it is used in this paper. shown that including the binary distinction from Talbanken05 is a corpus of Swedish spoken lan- this classifier as a feature in dependency parsing guage which includes a type of animacy annota- can significantly improve its labelled attachment tion (Nivre et al., 2006). However, this annotation score (Øvrelid and Nivre, 2007). is better described as a distinction between human A more language-specific system for animacy and non-human, than between animate and inani- classification has also been developed for Japanese mate (Øvrelid, 2009). Although the work in this (Baker and Brew, 2010). In this work, vari- 66 Table 1: Overview of the animacy tag set from Zaenen et al. (2004) with examples from the GMB. Tag Description Examples HUM Human Mr. Calderon said Mexico has become a worldwide leader ... ORG Organization Mr. Calderon said Mexico has become a worldwide leader ... ANI Animal There are only about 1,600 pandas still living in the wild in China. LOC Place There are only about 1,600 pandas still living in the wild in China. NCN Non-concrete There are only about 1,600 pandas still living in the wild in China. CNC Concrete The wind blew so much dust around the field today. TIM Time The wind blew so much dust around the field today. MAC Machine The astronauts attached the robot, called Dextre, to the ... VEH Vehicle Troops fired on the two civilians riding a motorcycle ... ous language-specific heuristics are used to im- resource, from which word-senses were obtained prove coverage of, e.g., loanwords from English. and merged per lemma. This is done, as they pos- The features used are mainly frequency counts of tulate that ambiguity in animacy per lemma ought nouns as subjects or objects of certain verbs. This to be relatively rare. Each lemma was then as- is then fed to a Bayesian classifier, which yields signed a simplified animacy class depending on quite good results on both Japanese and English. its animacy category – either human, non-human Taking these works into account, it is clear that or inanimate.

Load more