Unsupervised and Semi-Supervised Learning of English Noun Gender

Glen, Glenda or Glendale: Unsupervised and Semi-supervised Learning of English Noun Gender Shane Bergsma Dekang Lin Randy Goebel Department of Computing Science Google, Inc. Department of Computing Science University of Alberta 1600 Amphitheatre Parkway University of Alberta Edmonton, Alberta Mountain View Edmonton, Alberta Canada, T6G 2E8 California, 94301 Canada, T6G 2E8 [email protected] [email protected] [email protected] Abstract lot of world knowledge (general knowledge of word types). If she is replaced with the pronoun he in (1), English pronouns like he and they reliably re- Glen becomes the antecedent. Pronoun resolution flect the gender and number of the entities to systems need the knowledge of noun gender that ad- which they refer. Pronoun resolution systems can use this fact to filter noun candidates that vises that Glen is usually masculine (and thus re- do not agree with the pronoun gender. In- ferred to by he) while Glenda is feminine. deed, broad-coverage models of noun gender English third-person pronouns are grouped in four have proved to be the most important source gender/number categories: masculine (he, his, him, of world knowledge in automatic pronoun res- himself ), feminine (she, her, herself ), neutral (it, its, olution systems. itself ), and plural (they, their, them, themselves). We Previous approaches predict gender by count- broadly refer to these gender and number classes ing the co-occurrence of nouns with pronouns simply as gender. The objective of our work is to of each gender class. While this provides use- correctly assign gender to English noun tokens, in ful statistics for frequent nouns, many infre- context; to determine which class of pronoun will quent nouns cannot be classified using this method. Rather than using co-occurrence in- refer to a given noun. formation directly, we use it to automatically One successful approach to this problem is to annotate training examples for a large-scale build a statistical gender model from a noun’s asso- discriminative gender model. Our model col- ciation with pronouns in text. For example, Ge et al. lectively classifies all occurrences of a noun (1998) learn Ford has a 94% chance of being neu- in a document using a wide variety of con- tral, based on its frequent co-occurrence with neu- textual, morphological, and categorical gender tral pronouns in text. Such estimates are noisy but features. By leveraging large volumes of un- labeled data, our full semi-supervised system useful. Both Ge et al. (1998) and Bergsma and Lin reduces error by 50% over the existing state- (2006) show that learned gender is the most impor- of-the-art in gender classification. tant feature in their pronoun resolution systems. English differs from other languages like French 1 Introduction and German in that gender is not an inherent gram- matical property of an English noun, but rather a Pronoun resolution is the process of determining property of a real-world entity that is being referred which preceding nouns are referred to by a partic- to. A common noun like lawyer can be (semanti- ular pronoun in text. Consider the sentence: cally) masculine in one document and feminine in another. While previous statistical gender models (1) Glen told Glenda that she was wrong about learn gender for noun types only, we use document Glendale. context to correctly determine the current gender A pronoun resolution system should determine that class of noun tokens, making dynamic decisions on the pronoun she refers to the noun Glenda. Pro- common nouns like lawyer and ambiguous names noun resolution is challenging because it requires a like Ford. Furthermore, if a noun type has not yet 120 Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL), pages 120–128, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics been observed (an unknown word), previous ap- 2 Path-based Statistical Noun Gender proaches cannot estimate the gender. Our system, on the other hand, is able to correctly determine that Seed (noun,gender) examples can be extracted re- unknown words corroborators and propeller-heads liably and automatically from raw text, providing are plural, while Pope Formosus is masculine, using the training data for our discriminative classifier. learned contextual and morphological cues. We call these examples pseudo-seeds because they Our approach is based on the key observation that are created fully automatically, unlike the small set while gender information from noun-pronoun co- of manually-created seeds used to initialize other occurrence provides imperfect noun coverage, it can bootstrapping approaches (cf. the bootstrapping ap- nevertheless provide rich and accurate training data proaches discussed in Section 6). for a large-scale discriminative classifier. The clas- We adopt a statistical approach to acquire the sifier leverages a wide variety of noun properties to pseudo-seed (noun,gender) pairs. All previous sta- generalize from the automatically-labeled examples. tistical approaches rely on a similar observation: if The steps in our approach are: a noun like Glen is often referred to by masculine pronouns, like he or his, then Glen is likely a mas- 1. Training: culine noun. But for most nouns we have no an- (a) Automatically extract a set of seed notated data recording their coreference with pro- (noun,gender) pairs from high-quality in- nouns, and thus no data from which we can ex- stances in a statistical gender database. tract the co-occurrence statistics. Thus previous ap- (b) In a large corpus of text, find documents con- proaches rely on either hand-crafted coreference- taining these nouns. indicating patterns (Bergsma, 2005), or iteratively (c) For all instances of each noun in each document, guess and improve gender models through expec- create a single, composite feature vector repre- tation maximization of pronoun resolution (Cherry senting all the contexts of the noun in the docu- and Bergsma, 2005; Charniak and Elsner, 2009). In ment, as well as encoding other selected proper- statistical approaches, the more frequent the noun, ties of the noun type. the more accurate the assignment of gender. (d) Label each feature vector with the seed noun’s We use the approach of Bergsma and Lin (2006), corresponding gender. both because it achieves state-of-the-art gender (e) Train a 4-way gender classifier (masculine, fem- classification performance, and because a database inine, neutral, plural) from the automatically- of the obtained noun genders is available online.1 labeled vectors. Bergsma and Lin (2006) use an unsupervised 2. Testing: algorithm to identify syntactic paths along which a noun and pronoun are highly likely to corefer. To (a) Given a new document, create a composite fea- extract gender information, they processed a large ture vector for all occurrences of each noun. corpus of news text, and obtained co-occurrence (b) Use the learned classifier to assign gender to counts for nouns and pronouns connected with these each feature vector, and thus all occurrences of paths in the corpus. In their database, each noun is all nouns in the document. listed with its corresponding masculine, feminine, This algorithm achieves significantly better per- neutral, and plural pronoun co-occurrence counts, formance than the existing state-of-the-art statisti- e.g.: cal gender classifier, while requiring no manually- glen 555 42 32 34 labeled examples to train. Furthermore, by training glenda 8 102 0 11 on a small number of manually-labeled examples, glendale 24 2 167 18 we can combine the predictions of this system with glendalians 0 0 0 1 the counts from the original gender database. This glenn 3182 207 95 54 semi-supervised extension achieves 95.5% accuracy glenna 0 6 0 0 on final unseen test data, an impressive 50% reduc- tion in error over previous work. 1Available at http://www.cs.ualberta.ca/˜bergsma/Gender/ 121 This sample of the gender data shows that the . noun glenda, for example, occurs 8 times with mas- stefanie culine pronouns, 102 times with feminine pronouns, steffi graf 0 times with neutral pronouns, and 11 times with steinem plural pronouns; 84% of the time glenda co-occurs stella mccartney with a feminine pronoun. Note that all nouns in the stellar jayne data have been converted to lower-case.2 stepdaughter There are gender counts for 3.1 million English stephanie nouns in the online database. These counts form the stephanie herseth basis for the state-of-the-art gender classifier. We stephanie white can either take the most-frequent pronoun-gender stepmother (MFPG) as the class (e.g. feminine for glenda), or stewardess we can supply the logarithm of the counts as features . in a 4-way multi-class classifier. We implement the latter approach as a comparison system and refer to Figure 1: Sample feminine seed nouns it as PATHGENDER in our experiments. In our approach, rather than using these counts following section. Figure 1 provides a portion of the directly, we process the database to automatically ordered feminine seed nouns that we extracted. extract a high-coverage but also high-quality set of pseudo-seed (noun,gender) pairs. First, we filter 3 Discriminative Learning of Gender nouns that occur less than fifty times and whose MFPG accounts for less than 85% of counts. Next, Once we have extracted a number of pseudo-seed we note that the most reliable nouns should occur (noun,gender) pairs, we use them to automatically- relatively often in a coreferent path. For exam- label nouns (in context) in raw text. The auto- ple, note that importance occurs twice as often on labeled examples provide training data for discrimi- the web as Clinton, but has twenty-four times less native learning of noun gender. counts in the gender database.

Unsupervised and Semi-Supervised Learning of English Noun Gender

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support