A Bootstrapping Method for Learning Semantic Lexicons Using Extraction Pattern Contexts

Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, July 2002, pp. 214-221. Association for Computational Linguistics. A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts Michael Thelen and Ellen Riloff School of Computing University of Utah Salt Lake City, UT 84112 USA {thelenm,riloff}@cs.utah.edu Abstract generated by their semantic lexicon learner were not present in WordNet. These results suggest that auto- This paper describes a bootstrapping al- matic semantic lexicon acquisition could be used to gorithm called Basilisk that learns high- enhance existing resources such as WordNet, or to quality semantic lexicons for multiple cate- produce semantic lexicons for specialized domains. gories. Basilisk begins with an unannotated We have developed a weakly supervised bootstrap- corpus and seed words for each semantic ping algorithm called Basilisk that automatically category, which are then bootstrapped to generates semantic lexicons. Basilisk hypothesizes learn new words for each category. Basilisk the semantic class of a word by gathering collective hypothesizes the semantic class of a word evidence about semantic associations from extraction based on collective information over a large pattern contexts. Basilisk also learns multiple se- body of extraction pattern contexts. We mantic classes simultaneously, which helps constrain evaluate Basilisk on six semantic categories. the bootstrapping process. The semantic lexicons produced by Basilisk First, we present Basilisk’s bootstrapping algo- have higher precision than those produced rithm and explain how it differs from previous work by previous techniques, with several cate- on semantic lexicon induction. Second, we present gories showing substantial improvement. empirical results showing that Basilisk outperforms a previous algorithm. Third, we explore the idea of 1 Introduction learning multiple semantic categories simultaneously by adding this capability to Basilisk as well as an- In recent years, several algorithms have been devel- other bootstrapping algorithm. Finally, we present oped to acquire semantic lexicons automatically or results showing that learning multiple semantic cat- semi-automatically using corpus-based techniques. egories simultaneously improves performance. For our purposes, the term semantic lexicon will refer to a dictionary of words labeled with semantic classes 2 Bootstrapping using Collective (e.g., “bird” is an animal and “truck” is a vehicle). Evidence from Extraction Patterns Semantic class information has proven to be useful for many natural language processing tasks, includ- Basilisk (Bootstrapping Approach to SemantIc ing information extraction (Riloff and Schmelzen- Lexicon Induction using Semantic Knowledge) is a bach, 1998; Soderland et al., 1995), anaphora resolu- weakly supervised bootstrapping algorithm that au- tion (Aone and Bennett, 1996), question answering tomatically generates semantic lexicons. Figure 1 (Moldovan et al., 1999; Hirschman et al., 1999), and shows the high-level view of Basilisk’s bootstrapping prepositional phrase attachment (Brill and Resnik, process. The input to Basilisk is an unannotated 1994). Although some semantic dictionaries do exist text corpus and a few manually defined seed words (e.g., WordNet (Miller, 1990)), these resources often for each semantic category. Before bootstrapping do not contain the specialized vocabulary and jargon begins, we run an extraction pattern learner over that is needed for specific domains. Even for rela- the corpus which generates patterns to extract ev- tively general texts, such as the Wall Street Journal ery noun phrase in the corpus. (Marcus et al., 1993) or terrorism articles (MUC- The bootstrapping process begins by selecting a 4 Proceedings, 1992), Roark and Charniak (Roark subset of the extraction patterns that tend to ex- and Charniak, 1998) reported that 3 of every 5 terms tract the seed words. We call this the pattern pool. The nouns extracted by these patterns become can- Generate all extraction patterns in the corpus didates for the lexicon and are placed in a candidate and record their extractions. word pool. Basilisk scores each candidate word by lexicon = {seed words} gathering all patterns that extract it and measur- i := 0 ing how strongly those contexts are associated with words that belong to the semantic category. The BOOTSTRAPPING five best candidate words are added to the lexicon, 1. Score all extraction patterns and the process starts over again. In this section, we 2. pattern pool = top ranked 20+i patterns describe Basilisk’s bootstrapping algorithm in more 3. candidate word pool =extractions detail and discuss related work. of patterns in pattern pool 4. Score candidate words in candidate word pool seed extraction patterns and words their extractions 5. Add top 5 candidate words to lexicon 6. i := i +1 BASILISK 7. Go to Step 1. initialize select add extractions of best patterns best patterns Figure 2: Basilisk’s bootstrapping algorithm pattern pool semantic candidate lexicon word pool add 5 best candidate words pattern for every noun phrase that appears. The patterns are then applied to the corpus and all of Figure 1: Basilisk Algorithm their extracted noun phrases are recorded. Figure 2 shows the bootstrapping process that follows, which we explain in the following sections. 2.1 Basilisk The input to Basilisk is a text corpus and a set of seed 2.1.1 The Pattern Pool and Candidate Pool words. We generated seed words by sorting the words The first step in the bootstrapping process is to in the corpus by frequency and manually identifying score the extraction patterns based on their tendency the 10 most frequent nouns that belong to each cat- to extract known category members. All words that egory. These seed words form the initial semantic are currently defined in the semantic lexicon are con- lexicon. In this section we describe the learning pro- sidered to be category members. Basilisk scores each cess for a single semantic category. In Section 3 we pattern using the RlogF metric that has been used will explain how the process is adapted to handle for extraction pattern learning (Riloff, 1996). The multiple categories simultaneously. score for each pattern is computed as: To identify new lexicon entries, Basilisk relies on extraction patterns to provide contextual evi- Fi RlogF(patterni)= ∗ log2(Fi)(1) dence that a word belongs to a semantic class. As Ni our representation for extraction patterns, we used the AutoSlog system (Riloff, 1996). AutoSlog’s where Fi is the number of category members ex- extraction patterns represent linguistic expressions tracted by patterni and Ni is the total number of that extract a noun phrase in one of three syntac- nouns extracted by patterni. Intuitively, the RlogF tic roles: subject, direct object, or prepositional metric is a weighted conditional probability; a pat- phrase object. For example, three patterns that tern receives a high score if a high percentage of its would extract people are: “<subject> was arrested”, extractions are category members, or if a moderate “murdered <direct object>”, and “collaborated with percentage of its extractions are category members <pp object>”. Extraction patterns represent linguis- anditextractsalotofthem. tic contexts that often reveal the meaning of a word The top N extraction patterns are put into a pat- by virtue of syntax and lexical semantics. Extraction tern pool. Basilisk uses a value of N=20 for the first patterns are typically designed to capture role rela- iteration, which allows a variety of patterns to be tionships. For example, consider the verb “robbed” considered, yet is small enough that all of the pat- 1 when it occurs in the active voice. The subject of terns are strongly associated with the category. “robbed” identifies the perpetrator, while the direct 1“Depleted” patterns are not included in this set. A object of “robbed” identifies the victim or target. pattern is depleted if all of its extracted nouns are already Before bootstrapping begins, we run AutoSlog ex- defined in the lexicon, in which case it has no unclassified haustively over the corpus to generate an extraction words to contribute. The purpose of the pattern pool is to narrow down “<np> was divided” the field of candidates for the lexicon. Basilisk col- Extractions: the country, the Medellin cartel, Colombia, lects all noun phrases (NPs) extracted by patterns in Peru, the army, Nicaragua the pattern pool and puts the head noun of each NP “ambassador to <np>” into the candidate word pool. Only these nouns are Extractions: Nicaragua, Peru, the UN, Panama considered for addition to the lexicon. As the bootstrapping progresses, using the same Unfortunately, this scoring function has a problem. value N=20 causes the candidate pool to become The average can be heavily skewed by one pattern stagnant. For example, let’s assume that Basilisk that extracts a large number of category members. performs perfectly, adding only valid category words For example, suppose word w is extracted by 10 pat- to the lexicon. After some number of iterations, all terns, 9 which do not extract any category members of the valid category members extracted by the top but the tenth extracts 50 category members. The 20 patterns will have been added to the lexicon, leav- average number of category members extracted by ing only non-category words left to consider. For this these patterns will be 5. This is misleading because reason, the pattern pool needs to be infused with new the only evidence linking word w with the semantic patterns so that more nouns (extractions) become category is a single, high-frequency extraction pat- available for consideration. To achieve this effect, tern (which may extract words that belong to other we increment the value of N by one after each boot- categories as well). strapping iteration. This ensures that there is always To alleviate this problem, we modified the scor- at least one new pattern contributing words to the ing function to compute the average logarithm of the candidate word pool on each successive iteration. number of category members extracted by each pattern.

Load more