A Corpus-Based Approach for Building Semantic Lexicons
Total Page:16
File Type:pdf, Size:1020Kb
A Corpus-Based Approach for Building Semantic Lexicons Ellen Riloff and Jessica Shepherd Department of Computer Science University of Utah Salt Lake City, UT 84112 riloff©cs, utah. edu Abstract never fully satisfy the need for semantic knowledge. Many domains are characterized by their own sub- Semantic knowledge can be a great asset to language containing terms and jargon specific to natural language processing systems, but the field. Representing all sublanguages in a single it is usually hand-coded for each applica- knowledge base would be nearly impossible. Fur- tion. Although some semantic information thermore, domain-specific semantic lexicons are use- is available in general-purpose knowledge ful for minimizing ambiguity problems. Within the bases such as WordNet and Cyc, many ap- context of a restricted domain, many polysemous plications require domain-specific lexicons words have a strong preference for one word sense, that represent words and categories for a so knowing the most probable word sense in a do- particular topic. In this paper, we present main can strongly constrain the ambiguity. a corpus-based method that can be used We have been experimenting with a corpus- to build semantic lexicons for specific cat- based method for building semantic lexicons semi- egories. The input to the system is a small automatically. Our system uses a text corpus and set of seed words for a category and a rep- a small set of seed words for a category to identify resentative text corpus. The output is a other words that also belong to the category. The ranked list of words that are associated algorithm uses simple statistics and a bootstrapping with the category. A user then reviews the mechanism to generate a ranked list of potential cat- top-ranked words and decides which ones egory words. A human then reviews the top words should be entered in the semantic lexicon. and selects the best ones for the dictionary. Our ap- In experiments with five categories, users proach is geared toward fast semantic lexicon con- typically found about 60 words per cate- struction: given a handful of seed words for a cate- gory in 10-15 minutes to build a core se- gory and a representative text corpus, one can build mantic lexicon. a semantic lexicon for a category in just a few min- utes. 1 Introduction In the first section, we describe the statistical bootstrapping algorithm for identifying candidate Semantic information can be helpful in almost all category words and ranking them. Next, we describe aspects of natural language understanding, includ- experimental results for five categories. Finally, we ing word sense disambiguation, selectional restric- discuss our experiences with additional categories tions, attachment decisions, and discourse process- and seed word lists, and summarize our results. ing. Semantic knowledge can add a great deal of power and accuracy to natural language processing 2 Generating a Semantic Lexicon systems. But semantic information is difficult to ob- tain. In most cases, semantic knowledge is encoded Our work is based on the observation that category manually for each application. members are often surrounded by other category There have been a few large-scale efforts to cre- members in text, for example in conjunctions (lions ate broad semantic knowledge bases, such as Word- and tigers and bears), lists (lions, tigers, bears...), Net (Miller, 1990) and Cyc (Lenat, Prakash, and appositives (the stallion, a white Arabian), and nom- Shepherd, 1986). While these efforts may be use- inal compounds (Arabian stallion; tuna fish). Given ful for some applications, we believe that they will a few category members, we wondered whether it 117 would be possible to collect surrounding contexts The context windows do not cut across sen- and use statistics to identify other words that also tence boundaries. Note that our context win- belong to the category. Our approach was moti- dow is much narrower than those used by other vated by Yarowsky's word sense disambiguation al- researchers (Yarowsky, 1992). We experimented gorithm (Yarowsky, 1992) and the notion of statis- with larger window sizes and found that the nar- tical salience, although our system uses somewhat row windows more consistently included words different statistical measures and techniques. related to the target category. We begin with a small set of seed words for a category. We experimented with different numbers . Given the context windows for a category, we of seed words, but were surprised to find that only compute a category score for each word, which 5 seed words per category worked quite well. As an is essentially the conditional probability that example, the seed word lists used in our experiments the word appears in a category context. The are shown below. category score of a word W for category C is defined as: Energy: fuel gas gasoline oil power ¢corefW ¢7~ - /reg. o/ w in O's context windows Financial: bank banking currency dollar money v/ freq. o] W in corpus Military: army commander infantry soldier troop Note that this is not exactly a conditional prob- Vehicle: airplane car jeep plane truck ability because a single word occurrence can be- Weapon: bomb dynamite explosives gun rifle long to more than one context window. For example, consider the sentence: I bought an Figure 1: Initial Seed Word Lists AK-~7 gun and an M-16 rifle. The word M-16 would be in the context windows for both gun The input to our system is a text corpus and an and rifle even though there was just one occur- initial set of seed words for each category. Ideally, rence of it in the sentence. Consequently, the the text corpus should contain many references to category score for a word can be greater than 1. the category. Our approach is designed for domain- specific text processing, so the text corpus should be . Next, we remove stopwords, numbers, and any a representative sample of texts for the domain and words with a corpus frequency < 5. We used the categories should be semantic classes associated a stopword list containing about 30 general with the domain. Given a text corpus and an initial nouns, mostly pronouns (e.g., /, he, she, they) seed word list for a category C, the algorithm for and determiners (e.g., this, that, those). The building a semantic lexicon is as follows: stopwords and numbers are not specific to any category and are common across many domains, 1. We identify all sentences in the text corpus that so we felt it was safe to remove them. The re- contain one of the seed words. Each sentence is maining nouns are sorted by category score and given to our parser, which segments the sen- ranked so that the nouns most strongly associ- tence into simple noun phrases, verb phrases, ated with the category appear at the top. and prepositional phrases. For our purposes, we do not need any higher level parse structures. The top five nouns that are not already seed words are added to the seed word list dynam- 2. We collect small context windows surrounding ically. We then go back to Step 1 and repeat each occurrence of a seed word as a head noun the process. This bootstrapping mechanism dy- in the corpus. Restricting the seed words to namically grows the seed word list so that each be head nouns ensures that the seed word is iteration produces a larger category context. In the main concept of the noun phrase. Also, our experiments, the top five nouns were added this reduces the chance of finding different word automatically without any human intervention, senses of the seed word (though multiple noun but this sometimes allows non-category words word senses may still be a problem). We use a to dilute the growing seed word list. A few in- very narrow context window consisting of only appropriate words are not likely to have much two words, the first noun to the word's right impact, but many inappropriate words or a few and the first noun to its left. We collected only highly frequent words can weaken the feedback nouns under the assumption that most, if not process. One could have a person verify that all, true category members would be nouns3 each word belongs to the target category be- 1 Of course, this may depend on the target categories. fore adding it to the seed word list, but this 118 would require human interaction at each itera- 3 Experimental Results tion of the feedback cycle. We decided to see We performed experiments with five categories to how well the technique could work without this evaluate the effectiveness and generality of our ap- additional human interaction, but the potential proach: energy, financial, military, vehicles, and benefits of human feedback still need to be in- weapons. The MUC-4 development corpus (1700 vestigated. texts) was used as the text corpus (MUC-4 Pro- ceedings, 1992). We chose these five categories be- After several iterations, the seed word list typi- cause they represented relatively different semantic cally contains many relevant category words. But classes, they were prevalent in the MUC-4 corpus, more importantly, the ranked list contains many ad- and they seemed to be useful categories. ditional category words, especially near the top. The For each category, we began with the seed word number of iterations can make a big difference in lists shown in Figure 1. We ran the bootstrapping the quality of the ranked list. Since new seed words algorithm for eight iterations, adding five new words are generated dynamically without manual review, to the seed word list after each cycle. After the final the quality of the ranked list can deteriorate rapidly iteration, we had ranked lists of potential category when too many non-category words become seed words for each of the five categories.