Learning to Identify Animate References
Total Page:16
File Type:pdf, Size:1020Kb
Learning to identify animate references Constantin Orasan˘ Richard Evans School of Humanities, Languages School of Humanities, Languages and Social Sciences and Social Sciences University of Wolverhampton University of Wolverhampton [email protected] [email protected] Abstract Moreover, in (Evans and Or˘asan, 2000) it was argued that it is not always desirable to obtain Information about the animacy of information concerning the specific gender of nouns is important for a wide range of a NP’s referent in English. Instead, it is more tasks in NLP. In this paper, we present effective to obtain the animacy of each NP. We a method for determining the animacy define animacy as the property of a NP whereby of English nouns using WordNet and its referent, in singular rather than plural number, machine learning techniques. Our can be referred to using a pronoun in the set g method firstly categorises the senses fhe, him, his, himself, she, her, hers, herself . from WordNet using an annotated During the course of this paper, we will discuss corpus and then uses this information animate and inanimate senses of nouns and verbs. in order to classify nouns for which We use these expressions to denote the senses the sense is not known. Our evaluation of nouns that are the heads of NPs referring to results show that the accuracy of the animate/inanimate entities and the senses of verbs classification of a noun is around 97% whose agents are typically animate/inanimate and that animate entities are more entities. difficult to identify than inanimate In our previous work, we investigated the use ones. of WordNet in order to determine the animacy of entities in discourse. There, we used the fact that 1 Introduction each noun and verb sense is derived from unique classes called unique beginners. We classified Information on the gender of noun phrase (NP) each unique beginner as being a hypernym of a referents can be exploited in a range of NLP set of senses that were for the most part either tasks including anaphora resolution and the animate or inanimate (in the case of nouns) or applications that can benefit from it such as indicative of animacy/inanimacy in their subjects coreference resolution, information retrieval, (in the case of verbs). In classifying a noun, the information extraction, machine translation, number of its senses that belong to an animate etc. The gender of NP referents is explicitly class is compared with the number belonging realised morphologically in languages such as to an inanimate class, and this information is Romanian, French, Russian, etc. in which the used to make the final classification. In addition, head of the NP or the NP’s determiner undergoes if the noun is the head of a subject, the same predictable morphological transformation or information is computed for the verb. Our affixation to reflect its referent’s gender. In the assumption was that a noun with many animate English language, the gender of NPs’ referents is senses is likely to be used to refer to an animate not predictable from the surface morphology. entity. For subjects, the information from the main verb was used to take into consideration the content-words, nouns, verbs, adjectives and context of the sentence. That system, referred adverbs are arranged under a small set of so- to in this paper as the previous system also used called unique beginners. In the case of nouns a proper name gazetteer and some simple rules and verbs, which are the concern of the present which mainly assisted in the classification of paper, the unique beginners are the most general named entities. For reasons explained in Section concepts under which the entire set of entries is 4.2, these additions to the basic algorithm were organized on the basis of hyponymy/hypernymy ignored in the comparative evaluation described relations. Hypernymy is the relation that holds Ú ehicÐ e ×hiÔ ½ there. between such word senses as ½ - or hÙÑaÒ ÔÓÐ iØiciaÒ ½ Experiments with that algorithm showed it to ½ - , in which the first items be useful. Applied to a system for automatic in the pairs are more general than the second. pronominal anaphora resolution, it led to a Conversely, the second items are more specific substantial improvement in the ratio of suitable than the first, and are their hyponyms. and unsuitable candidates in the sets considered It is usual to regard hypernymy as a vertically by the anaphora resolver (Evans and Or˘asan, arranged relationship, with general senses 2000). positioned higher than more specific ones in an However, the previous system has two main ontology. In WordNet, the top-most senses are weaknesses. The first one comes from the fact called unique beginners. Senses at the same that the classes used to determine the number vertical level in the ontology are also clustered of animate/inanimate senses are too general, and horizontally through the synonymy relation in in most cases they do not reliably indicate the synsets. In this paper, the term node is used animacy of each sense in the class. The second interchangeably with synset. weakness is due to the naive nature of the rules As explained in Section 3.1, our method that decide if a NP is animate or not. Their requires that the nodes in WordNet are classified application is simple and involves a comparison according to their animacy. Given the size of of values obtained for a NP with threshold values WordNet, this task cannot be done manually and that were determined on the basis of a relatively a corpus where words are annotated with their small number of experiments. senses was necessary. A corpus that meets these In this paper, we present a new method for requirements is SEMCOR (Landes et al., 1998), animacy identification which uses WordNet and a subset of the Brown Corpus in which the nouns machine learning techniques. The remainder and the verbs have been manually annotated with of the paper is structured as follows. Section their senses from WordNet. 2 briefly describes some concepts concerning WordNet that are used in this paper. In Section 3, 3 The method our two step method is described. An evaluation In this section a two step method used to classify of the method and discussion of the results is words according to their animacy is presented. In presented in Section 4. We end the paper by Section 3.1, we present an automatic method for reviewing previous related work and drawing determining the animacy of senses from WordNet some conclusions. on the basis of an annotated corpus. Once the 2 Background information senses from WordNet have been classified, a classical machine learning technique uses this As previously mentioned, in this research information to determine the animacy of a noun WordNet (Fellbaum, 1998) is used to identify for which the sense is not known. This technique the animacy of a noun. In this section several is presented in Section 3.2. important concepts from WordNet are explained. WordNet is an electronic lexical resource 3.1 The classification of the senses organized hierarchically by relations between As previously mentioned, the unique beginners sets of synonyms or near-synonyms called are too general to be satisfactorily classified as synsets. Each of the four primary classes of animate or inanimate. However, this does not ¿ ¾ HYPERNYM 7 6 7 6 aÒi =¦aÒi i h 5 4 iÒaÒi =¦iÒaÒi i h ¾ ¿ ¾ ¿ ¾ ¿ ¾ ¿ Ë eÒ×e Ë eÒ×e ËeÒ×e Ë eÒ×e ½ ¾ ¿ Ò 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 aÒi aÒi aÒi aÒi ½ ¾ ¡¡¡ ¿ Ò 4 5 4 5 4 5 4 5 iÒaÒi iÒaÒi iÒaÒi iÒaÒi ½ ¾ ¿ Ò Figure 1: Example of hypernymy relation between senses in WordNet Ë eÒ×e Ë eÒ×e Ë eÒ×e Ë eÒ×e ¾ ¿ Ò ½ ... aÒi aÒi aÒi aÒi ¾ ¿ Ò Observed ½ ... aÒi · iÒaÒi aÒi · iÒaÒi aÒi · iÒaÒi aÒi · iÒaÒi ½ ¾ ¾ ¿ ¿ Ò Ò Expected ½ ... Table 1: Contingency table for testing if a hypernym is animate mean that it is not possible to uniquely classify hyponyms of a sense are animate, then the sense more specific senses as animate or inanimate. In itself is animate”. However, this does not always this section, we present a corpus-based method hold because of annotation errors or rare uses of which classifies the synsets from WordNet a sense and instead, a statistical measure must be according to their animacy. used to test the animacy of a more general node. The NPs in a 52 file subset of the SEMCOR Several measures were considered and the most corpus were manually annotated with animacy appropriate one seemed to be chi-square. information and then used by an automatic system Chi-square is a non-parametric test which can to classify the nodes. These 52 files contain 2512 be used for estimating whether or not there is animate entities and 17514 inanimate entities. any difference between the frequencies of items The system attempts to classify the senses in frequency tables (Oakes, 1998). The formula from WordNet that explicitly appear in the used to calculate chi-square is: corpus directly, on the basis of their frequency.1 ¾ X ´Ç E µ ¾ = However, our goal is to design a procedure which (1) is also able to classify senses that are not found E in the corpus. To this end, we decided to use a where O is the observed number of cases and E ¾ bottom up procedure which starts by classifying the expected number of cases. If is less than the terminal nodes and then continues with more or equal to a critical level, we may conclude that general nodes.