Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, July 2002, pp. 214-221. Association for Computational Linguistics.
A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts
Michael Thelen and Ellen Riloff School of Computing University of Utah Salt Lake City, UT 84112 USA {thelenm,riloff}@cs.utah.edu
Abstract generated by their semantic lexicon learner were not present in WordNet. These results suggest that auto- This paper describes a bootstrapping al- matic semantic lexicon acquisition could be used to gorithm called Basilisk that learns high- enhance existing resources such as WordNet, or to quality semantic lexicons for multiple cate- produce semantic lexicons for specialized domains. gories. Basilisk begins with an unannotated We have developed a weakly supervised bootstrap- corpus and seed words for each semantic ping algorithm called Basilisk that automatically category, which are then bootstrapped to generates semantic lexicons. Basilisk hypothesizes learn new words for each category. Basilisk the semantic class of a word by gathering collective hypothesizes the semantic class of a word evidence about semantic associations from extraction based on collective information over a large pattern contexts. Basilisk also learns multiple se- body of extraction pattern contexts. We mantic classes simultaneously, which helps constrain evaluate Basilisk on six semantic categories. the bootstrapping process. The semantic lexicons produced by Basilisk First, we present Basilisk’s bootstrapping algo- have higher precision than those produced rithm and explain how it differs from previous work by previous techniques, with several cate- on semantic lexicon induction. Second, we present gories showing substantial improvement. empirical results showing that Basilisk outperforms a previous algorithm. Third, we explore the idea of 1 Introduction learning multiple semantic categories simultaneously by adding this capability to Basilisk as well as an- In recent years, several algorithms have been devel- other bootstrapping algorithm. Finally, we present oped to acquire semantic lexicons automatically or results showing that learning multiple semantic cat- semi-automatically using corpus-based techniques. egories simultaneously improves performance. For our purposes, the term semantic lexicon will refer to a dictionary of words labeled with semantic classes 2 Bootstrapping using Collective (e.g., “bird” is an animal and “truck” is a vehicle). Evidence from Extraction Patterns Semantic class information has proven to be useful for many natural language processing tasks, includ- Basilisk (Bootstrapping Approach to SemantIc ing information extraction (Riloff and Schmelzen- Lexicon Induction using Semantic Knowledge) is a bach, 1998; Soderland et al., 1995), anaphora resolu- weakly supervised bootstrapping algorithm that au- tion (Aone and Bennett, 1996), question answering tomatically generates semantic lexicons. Figure 1 (Moldovan et al., 1999; Hirschman et al., 1999), and shows the high-level view of Basilisk’s bootstrapping prepositional phrase attachment (Brill and Resnik, process. The input to Basilisk is an unannotated 1994). Although some semantic dictionaries do exist text corpus and a few manually defined seed words (e.g., WordNet (Miller, 1990)), these resources often for each semantic category. Before bootstrapping do not contain the specialized vocabulary and jargon begins, we run an extraction pattern learner over that is needed for specific domains. Even for rela- the corpus which generates patterns to extract ev- tively general texts, such as the Wall Street Journal ery noun phrase in the corpus. (Marcus et al., 1993) or terrorism articles (MUC- The bootstrapping process begins by selecting a 4 Proceedings, 1992), Roark and Charniak (Roark subset of the extraction patterns that tend to ex- and Charniak, 1998) reported that 3 of every 5 terms tract the seed words. We call this the pattern pool. The nouns extracted by these patterns become can- Generate all extraction patterns in the corpus didates for the lexicon and are placed in a candidate and record their extractions. word pool. Basilisk scores each candidate word by lexicon = {seed words} gathering all patterns that extract it and measur- i := 0 ing how strongly those contexts are associated with words that belong to the semantic category. The BOOTSTRAPPING five best candidate words are added to the lexicon, 1. Score all extraction patterns and the process starts over again. In this section, we 2. pattern pool = top ranked 20+i patterns describe Basilisk’s bootstrapping algorithm in more 3. candidate word pool =extractions detail and discuss related work. of patterns in pattern pool 4. Score candidate words in candidate word pool seed extraction patterns and words their extractions 5. Add top 5 candidate words to lexicon 6. i := i +1
BASILISK 7. Go to Step 1. initialize select add extractions of best patterns best patterns Figure 2: Basilisk’s bootstrapping algorithm pattern pool semantic candidate lexicon word pool add 5 best candidate words pattern for every noun phrase that appears. The patterns are then applied to the corpus and all of Figure 1: Basilisk Algorithm their extracted noun phrases are recorded. Figure 2 shows the bootstrapping process that follows, which we explain in the following sections. 2.1 Basilisk The input to Basilisk is a text corpus and a set of seed 2.1.1 The Pattern Pool and Candidate Pool words. We generated seed words by sorting the words The first step in the bootstrapping process is to in the corpus by frequency and manually identifying score the extraction patterns based on their tendency the 10 most frequent nouns that belong to each cat- to extract known category members. All words that egory. These seed words form the initial semantic are currently defined in the semantic lexicon are con- lexicon. In this section we describe the learning pro- sidered to be category members. Basilisk scores each cess for a single semantic category. In Section 3 we pattern using the RlogF metric that has been used will explain how the process is adapted to handle for extraction pattern learning (Riloff, 1996). The multiple categories simultaneously. score for each pattern is computed as: To identify new lexicon entries, Basilisk relies on extraction patterns to provide contextual evi- Fi RlogF(patterni)= ∗ log2(Fi)(1) dence that a word belongs to a semantic class. As Ni our representation for extraction patterns, we used the AutoSlog system (Riloff, 1996). AutoSlog’s where Fi is the number of category members ex- extraction patterns represent linguistic expressions tracted by patterni and Ni is the total number of that extract a noun phrase in one of three syntac- nouns extracted by patterni. Intuitively, the RlogF tic roles: subject, direct object, or prepositional metric is a weighted conditional probability; a pat- phrase object. For example, three patterns that tern receives a high score if a high percentage of its would extract people are: “
Correct Lexicon Entries 100 Correct Lexicon Entries 50 0 0 that belong to categories B and E. When generating 0 200 400 600 800 1000 0 200 400 600 800 1000 Total Lexicon Entries Total Lexicon Entries a lexicon for one category at a time, these confusion Time Weapon 40 80 errors are impossible to detect because the learner ba-1 ba-1 35 mb-1 70 mb-1 has no knowledge of the other categories. 30 60 25 50 20 40 15 30 10 20 B
Correct Lexicon Entries 5 Correct Lexicon Entries 10 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Total Lexicon Entries Total Lexicon Entries C A Figure 3: Basilisk and Meta-Bootstrapping Results, Single Category F to the “one sense per discourse” observation (Gale et al., 1992)) that a word belongs to a single semantic DE category within a limited domain. All of our ex- periments involve the MUC-4 terrorism domain and corpus, for which this assumption seems appropriate. Figure 5: Bootstrapping Multiple Categories
3.1 Motivation Figure 5 shows the same search space when lexi- cons are generated for six categories simultaneously. Figure 4 shows one way of viewing the task of se- If the lexicons cannot overlap, then we constrain the mantic lexicon induction. The set of all words in the ability of a category to overstep its bounds. Cate- corpus is visualized as a search space. Each cate- gory C is stopped when it begins to encroach upon gory owns a certain territory within the space (de- the territories of categories B and E because words marcated with a dashed line), representing the words in those areas have already been claimed. that are true members of that category. Not all ter- ritories are the same size, since some categories have 3.2 Simple Conflict Resolution more members than others. The easiest way to take advantage of multiple cate- gories is to add simple conflict resolution that en- forces the “one sense per domain” constraint. If B more than one category tries to claim a word, then C we use conflict resolution to decide which category should win. We incorporated a simple conflict reso- A lution procedure into Basilisk, as well as the meta- bootstrapping algorithm. For both algorithms, the F conflict resolution procedure works as follows. (1) If E D a word is hypothesized for category A but has already been assigned to category B during a previous iter- ation, then the category A hypothesis is discarded. (2) If a word is hypothesized for both category A Figure 4: Bootstrapping a Single Category and category B during the same iteration, then it Building Event Building Event 120 300 100 250 ba-M ba-M 90 mb-M mb-M ba-1 ba-1 mb-1 mb-1 100 250 80 200 80 200 70 60 150 60 150 50 40 100 40 100 30 20 50 20 50 Correct Lexicon Entries Correct Lexicon Entries Correct Lexicon Entries 10 Correct Lexicon Entries 0 0 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Total Lexicon Entries Total Lexicon Entries Total Lexicon Entries Total Lexicon Entries
Human Location Human Location 800 600 700 600 ba-M ba-M mb-M mb-M 700 ba-1 500 ba-1 600 mb-1 500 mb-1 600 500 400 400 500 400 400 300 300 300 300 200 200 200 200 100 100 100 Correct Lexicon Entries 100 Correct Lexicon Entries Correct Lexicon Entries Correct Lexicon Entries 0 0 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Total Lexicon Entries Total Lexicon Entries Total Lexicon Entries Total Lexicon Entries
Time Weapon Time Weapon 40 90 30 100 ba-M ba-M mb-M 90 mb-M 35 ba-1 80 ba-1 mb-1 mb-1 25 80 30 70 60 20 70 25 60 50 20 15 50 40 15 40 30 10 30 10 20 5 20 Correct Lexicon Entries 5 Correct Lexicon Entries 10 Correct Lexicon Entries Correct Lexicon Entries 10 0 0 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Total Lexicon Entries Total Lexicon Entries Total Lexicon Entries Total Lexicon Entries
Figure 6: Basilisk, MCAT vs. 1CAT Figure 7: Meta-Bootstrapping, MCAT vs. 1CAT is assigned to the category for which it receives the targeted category and there is little evidence that it highest score. In Section 3.4, we will present empiri- belongs to a different category. This has the effect cal results showing how this simple conflict resolution of steering the bootstrapping process away from am- scheme affects performance. biguous parts of the search space.
3.3 A Smarter Scoring Function for 3.4 Multiple Category Results Multiple Categories We will use the abbreviation 1CAT to indicate that Simple conflict resolution helps the algorithm only one semantic category was bootstrapped, and recognize when it has encroached on another cate- MCAT to indicate that multiple semantic categories gory’s territory, but it does not actively steer the were simultaneously bootstrapped. Figure 6 com- bootstrapping in a more promising direction. A pares the performance of Basilisk-MCAT with con- more intelligent way to handle multiple categories flict resolution (ba-M) against Basilisk-1CAT (ba-1). is to incorporate knowledge about other categories Most categories show small performance gains, with directly into the scoring function. We modified the building, location,andweapon categories Basilisk’s scoring function to prefer words that have benefitting the most. However, the improvement strong evidence for one category but little or no usually doesn’t kick in until many bootstrapping it- evidence for competing categories. Each word wi in erations have passed. This phenomenon is consistent the candidate word pool receives a score for category with the visualization of the search space in Figure 5. ca based on the following formula: Since the seed words for each category are not gener- ally located near each other in the search space, the diff(wi,ca) = AvgLog(wi,ca)-max(AvgLog(wi,cb)) bootstrapping process is unaffected by conflict reso- b6=a lution until the categories begin to encroach on each where AvgLog is the candidate scoring function used other’s territories. previously by Basilisk (see Equation 3) and the max Figure 7 compares the performance of Meta- function returns the maximum AvgLog value over Bootstrapping-MCAT with conflict resolution all competing categories. For example, the score for (mb-M) against Meta-Bootstrapping-1CAT (mb- each candidate location word will be its AvgLog 1). Learning multiple categories improves the score for the location category minus its maxi- performance of meta-bootstrapping dramatically mum AvgLog score for all other categories. A word for most categories. We were surprised that the is ranked highly only if it has a high score for the improvement for meta-bootstrapping was much Building Event Total MetaBoot Basilisk 120 300 ba-M+ ba-M+ Words 1CAT MCAT+ 100 ba-M 250 ba-M mb-M mb-M building 80 200 100 21 (21.0%) 39 (39.0%) 60 150 200 28 (14.0%) 72 (36.0%) 40 100
20 50 500 33 (6.6%) 100 (20.0%) Correct Lexicon Entries Correct Lexicon Entries 0 0 800 39 (4.9%) 109 (13.6%) 0 200 400 600 800 1000 0 200 400 600 800 1000 Total Lexicon Entries Total Lexicon Entries 1000 43 (4.3%) n/a
Human Location event 900 600 ba-M+ ba-M+ 800 100 61 (61.0%) 61 (61.0%) ba-M 500 ba-M 700 mb-M mb-M 200 89 (44.5%) 114 (57.0%) 600 400 500 500 146 (29.2%) 186 (37.2%) 300 400 300 200 800 172 (21.5%) 240 (30.0%) 200 100 1000 190 (19.0%) 266 (26.6%)
Correct Lexicon Entries 100 Correct Lexicon Entries 0 0 human 0 200 400 600 800 1000 0 200 400 600 800 1000 Total Lexicon Entries Total Lexicon Entries 100 36 (36.0%) 84 (84.0%) Time Weapon 200 53 (26.5%) 173 (86.5%) 45 100 ba-M+ 90 ba-M+ 500 143 (28.6%) 431 (86.2%) 40 ba-M ba-M 35 mb-M 80 mb-M 800 224 (28.0%) 681 (85.1%) 30 70 60 25 1000 278 (27.8%) 829 (82.9%) 50 20 40 location 15 30 10 20 100 54 (54.0%) 84 (84.0%)
Correct Lexicon Entries 5 Correct Lexicon Entries 10 0 0 200 99 (49.5%) 175 (87.5%) 0 200 400 600 800 1000 0 200 400 600 800 1000 Total Lexicon Entries Total Lexicon Entries 500 237 (47.4%) 371 (74.2%) 800 302 (37.8%) 509 (63.6%) Figure 8: MetaBoot-MCAT vs. Basilisk-MCAT vs. 1000 310 (31.0%) n/a time Basilisk-MCAT+ 100 9 (9.0%) 30 (30.0%) 200 13 (6.5%) 33 (16.5%) more pronounced than for Basilisk. It seems that 500 21 (4.2%) 37 (7.4%) 800 25 (3.1%) 43 (5.4%) Basilisk was already doing a better job with errors 1000 26 (2.6%) 45 (4.5%) of confusion, so meta-bootstrapping had more room weapon for improvement. 100 23 (23.0%) 42 (42.0%) Finally, we evaluated Basilisk using the diff scoring 200 24 (12.0%) 62 (31.0%) function to handle multiple categories. Figure 8 com- 500 29 (5.8%) 85 (17.0%) pares all three MCAT algorithms, with the smarter 800 33 (4.1%) 88 (11.0%) 1000 33 (3.3%) n/a diff version of Basilisk labeled as ba-M+. Over- all, this version of Basilisk performs best, showing Table 2: Lexicon Results a small improvement over the version with simple conflict resolution. Both multiple category versions of Basilisk also consistently outperform the multiple standard data shown in Table 1. The recall results category version of meta-bootstrapping. range from 40-60%, which indicates that a good per- Table 2 summarizes the improvement of the centage of the category words are being found, al- best version of Basilisk over the original meta- though there are clearly more category words lurking bootstrapping algorithm. The left-hand column rep- in the corpus. resents the number of words learned and each cell in- dicates how many of those words were correct. These 4 Conclusions results show that Basilisk produces substantially bet- ter accuracy and coverage than meta-bootstrapping. Basilisk’s bootstrapping algorithm exploits two Figure 9 shows examples of words learned by ideas: (1) collective evidence from extraction pat- Basilisk. Inspection of the lexicons reveals many un- terns can be used to infer semantic category associ- usual words that could be easily overlooked by some- ations, and (2) learning multiple semantic categories one building a dictionary by hand. For example, the simultaneously can help constrain the bootstrapping words “deserter” and “narcoterrorists” appear in a process. The accuracy achieved by Basilisk is sub- variety of terrorism articles but they are not com- stantially higher than that of previous techniques for monly used words in general. semantic lexicon induction on the MUC-4 corpus, We also measured the recall of Basilisk’s lexicons and empirical results show that both of Basilisk’s after 1000 words had been learned, based on the gold ideas contribute to its performance. We also demon- Building: theatre store cathedral temple palace W. Gale, K. Church, and David Yarowsky. 1992. A penitentiary academy houses school mansions method for disambiguating word senses in a large cor- Event: ambush assassination uprisings sabotage pus. Computers and the Humanities, 26:415–439. takeover incursion kidnappings clash shoot-out Niyu Ge, John Hale, and Eugene Charniak. 1998. A sta- Human: boys snipers detainees commandoes tistical approach to anaphora resolution. In Proceed- extremists deserter narcoterrorists demonstrators ings of the Sixth Workshop on Very Large Corpora. cronies missionaries M. Hearst. 1992. Automatic Acquisition of Hy- Location: suburb Soyapango capital Oslo regions ponyms from Large Text Corpora. In Proceedings of cities neighborhoods Quito corregimiento the Fourteenth International Conference on Computa- Time: afternoon evening decade hour March tional Linguistics (COLING-92). weeks Saturday eve anniversary Wednesday Lynette Hirschman, Marc Light, Eric Breck, and John D. Weapon: cannon grenade launchers firebomb Burger. 1999. Deep Read: A reading comprehension car-bomb rifle pistol machineguns firearms system. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Figure 9: Example Semantic Lexicon Entries M. Marcus, B. Santorini, and M. Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, strated that learning multiple semantic categories si- 19(2):313–330. multaneously improves the meta-bootstrapping algo- George Miller. 1990. Wordnet: An on-line lexical rithm, which suggests that this is a general observa- database. In International Journal of Lexicography. tion which may improve other bootstrapping algo- rithms as well. Dan Moldovan, Sanda Harabagiu, Marius Pa¸sca, Rada Mihalcea, Richard Goodrum, Roxana Gˆirju, and Vasile Rus. 1999. Lasso: A tool for surfing the an- 5 Acknowledgments swer net. In Proceedings of the Eighth Text REtrieval Conference (TREC-8). This research was supported by the National Science Foundation under award IRI-9704240. MUC-4 Proceedings. 1992. Proceedings of the Fourth Message Understanding Conference (MUC-4).Mor- gan Kaufmann, San Mateo, CA. References E. Riloff and R. Jones. 1999. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. Chinatsu Aone and Scott William Bennett. 1996. Ap- In Proceedings of the Sixteenth National Conference on plying machine learning to anaphora resolution. In Artificial Intelligence. Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Understanding, pages E. Riloff and M. Schmelzenbach. 1998. An Empirical 302–314. Springer-Verlag, Berlin. Approach to Conceptual Case Frame Acquisition. In Proceedings of the Sixth Workshop on Very Large Cor- E. Brill and P. Resnik. 1994. A Transformation- pora, pages 49–56. based Approach to Prepositional Phrase Attachment Disambiguation. In Proceedings of the Fifteenth In- E. Riloff and J. Shepherd. 1997. A Corpus-Based Ap- ternational Conference on Computational Linguistics proach for Building Semantic Lexicons. In Proceed- (COLING-94). ings of the Second Conference on Empirical Methods in Natural Language Processing, pages 117–124. S. Caraballo. 1999. Automatic Acquisition of a Hypernym-Labeled Noun Hierarchy from Text. In E. Riloff. 1996. Automatically Generating Extraction Proceedings of the 37th Annual Meeting of the Associ- Patterns from Untagged Text. In Proceedings of the ation for Computational Linguistics, pages 120–126. Thirteenth National Conference on Artificial Intelli- gence, pages 1044–1049. The AAAI Press/MIT Press. M. Collins and Y. Singer. 1999. Unsupervised Models B. Roark and E. Charniak. 1998. Noun-phrase Co- for Named Entity Classification. In Proceedings of the occurrence Statistics for Semi-automatic Semantic Joint SIGDAT Conference on Empirical Me thods in Lexicon Construction. In Proceedings of the 36th An- Natural Language Processing and Very Large Corpora nual Meeting of the Association for Computational (EMNLP/VLC-99). Linguistics, pages 1110–1116. S. Cucerzan and D. Yarowsky. 1999. Language Inde- Stephen Soderland, David Fisher, Jonathan Aseltine, pendent Named Entity Recognition Combining Morph and Wendy Lehnert. 1995. CRYSTAL: Inducing a ological and Contextual Evidence. In Proceedings of conceptual dictionary. In Proceedings of the Four- the Joint SIGDAT Conference on Empirical Me thods teenth International Joint Conference on Artificial In- in Natural Language Processing and Very Large Cor- telligence, pages 1314–1319. pora (EMNLP/VLC-99).