<<

Usage of WordNet in Natural Language Generation

Hongyan Jing Department of Columbia University New York, NY 10027, USA [email protected]

Abstract 2 Why a valuable resource for generation? WordNet has rarely been applied to natural lan- WordNet is a potentially valuable resource for guage generation, despite of its wide applica- generation for four reasons. First, tion in other fields. In this paper, we address sets in WordNet (synsets) can possibly provide three issues in the usage of WordNet in gener- large amount of lexical paraphrases. One ma- ation: adapting a general like WordNet jor shortcoming of current generation systems is to a specific application domain, how the infor- its poor expressive capability. Usually none or mation in WordNet can be used in generation, very limited paraphrases are provided by a gen- and augmenting WordNet with other types of eration system due to the cost of hand-coding in knowledge that are helpful for generation. We the lexicon. Synsets, however, provide the pos- propose a three step procedure to tailor - sibility to generate lexical paraphrases without Net to a specific domain, and carried out ex- tedious hand-coding in individual systems. For periments on a basketball corpus (1,015 game example, for the output sentence “Jordan hit a reports, 1.7MB). jumper”, we can generate the paraphrase “Jor- dan hit a jump shot” simply by replacing the word jumper in the sentence with its synonym 1 Introduction jump shot listed in WordNet synset. Whereas, such replacements are not always appropriate. WordNet (Miller et al., 1990) has been success- For example, tally and rack up are listed as syn- fully applied in many human language related onyms of the word score, although the sentence applications, such as word sense disambigua- like “Jordan scored 22 points” are common in tion, , and text categoriza- newspaper sport reports, sentences like “Jor- tion; yet generation is among the fields in which dan tallied 22 points” or “Jordan racked up 22 the application of WordNet has rarely been ex- points” seldomly occur. To successfully apply plored. We demonstrate in this paper that, as a WordNet for paraphrasing, we need to develop rich semantic net, WordNet is indeed a valuable techniques which can correctly identify inter- resource for generation. We propose a corpus changeability of in a certain context. based technique to adapt WordNet to a specific Secondly, as a semantic net linked by lexi- domain and present experiments in the basket- cal relations, WordNet can be used for lexical- ball domain. We also discuss possible ways to ization in generation. Lexicalization maps the use WordNet knowledge in the generation task semantic concepts to be conveyed to appropri- and to augment WordNet with other types of ate . Usually it is achieved by step-wise knowledge. refinements based on syntactic, semantic, and In Section 2, we answer the question why pragmatic constraints while traversing a seman- WordNet is useful for generation. In Section tic net (Danlos, 1987). Currently most genera- 3, we discuss problems to be solved to success- tion systems acquire their semantic net for lexi- fully apply WordNet to generation. In Section calization by building their own, while WordNet 4, we present techniques to solve the problems. provides the possibility to acquire such knowl- Finally, we present future work and conclude. edge automatically from an existing resource. Next, WordNet can be used for Once WordNet is tailored to the domain, the building domain ontology. Most current genera- main problem is how to use its knowledge in the tion systems manually build their domain ontol- generation process. As we mentioned in section ogy from scratch. The process is time and labor 2, WordNet can potentially benefit generation intensive, and introduction of errors is likely. in three aspects: producing large amount of lex- WordNet ontology has a wide coverage, so can ical paraphrases, providing the semantic net for possibly be used as a basis for building domain lexicalization, and providing a basis for building ontology. The problem to be solved is how to domain ontology. A number of problems to be adapt it to a specific domain. solved at this stage, including: (a)while using Finally, WordNet is indexed by concepts synset for producing paraphrases, how to de- rather than merely by words makes it especially termine whether two synonyms are interchange- desirable for the generation task. Unlike lan- able in a particular context? (b)while WordNet guage interpretation, generation has as inputs can provide the semantic net for lexicalization, the semantic concepts to be conveyed and maps the constraints to choose a particular node dur- them to appropriate words. Thus an ideal gen- ing lexical choice still need to be established. eration lexicon should be indexed by semantic (c) How to use the WordNet ontology? concepts rather than words. Most available lin- The last problem is relevant to augmenting guistic resources are not suitable to use in gen- WordNet with other types of information. Al- eration directly due to their lack of mapping be- though WordNet is a rich lexical , it tween concepts and words. WordNet is by far can not contain all types of information that the richest and largest database among all re- are needed for generation, for example, syntac- sources that are indexed by concepts. Other rel- tic information in WordNet is weak. It is then atively large and concept-based resources such worthwhile to investigate the possibility to com- as PENMAN ontology (Bateman et al., 1990) bine it with other resources. usually include only hyponymy relations com- In the following section, we address the above pared to the rich types of lexical relations pre- issues in order and present our experiment re- sented in WordNet. sults in the basketball domain.

3 Problems to be solved 4 Solutions Despite the above advantages, there are some 4.1 Adapting WordNet to a domain problems to be solved for the application of We propose a corpus based method to automat- WordNet in a generation system to be success- ically adapt a general resource like WordNet to ful. a domain. Most generation systems still use The first problem is how to adapt WordNet hand-coded and , however, to a particular domain. With 121,962 unique corpus based automatic techniques are in de- words, 99,642 synsets, and 173,941 senses of mand as natural language generation is used in words as of version 1.6, WordNet represents the more ambitious applications and large corpora largest publically available to in various domains are becoming available. The date. The wide coverage on one hand is benefi- proposed method involves three steps of pro- cial, since as a general resource, wide coverage cessing. allows it to provide information for different ap- plications. On the other hand, this can also be Step 1: Prune unused words and quite problematic since it is very difficult for synsets an application to efficiently handle such a large We first prune words and synsets that are database. Therefore, the first step towards uti- listed in WordNet but not used in the domain. lizing WordNet in generation is to prune unre- This is accomplished by tagging the domain cor- lated information in the general database so as pus with part of speech information, then for to tailor it to the domain. On the other hand, each word in WordNet, if it appears in the do- domain specific knowledge that is not covered main corpus and its part of speech is the same by the general database needs to be added to as that in the corpus, the word is kept in the re- the database. sult, otherwise it is eliminated; for each synset in WordNet, if none of the words in the synset { A } appears in the domain corpus, the synset as a / \ { A } whole is deleted. The only exception is that if { B } ... ===> / \ a synset is the closest common ancestor of two / \ { D } ... synsets in the domain corpus, the synset is al- { C } { F } ways kept in the result. The reason to keep this / \ kind of synsets is to generalize the semantic cat- { D }{ E } egory of arguments, as we illustrate in step 2. The frequency of words in such synsets will before after be marked zero so that they will not be used in output. Figure 1 shows two example prun- (A) Synset A and D appear in the corpus, ing operations: (A) is a general case, and (B) while B, C, E, and F do not. is the case involving ancestor synset. In this step, words are not yet disambiguated, so all the senses of a word remain in the result; the prun- { A } ing of unlikely senses is achieved in step 2, when / \ { A } verb argument clusters are utilized. Words that { B } { C } ===> / \ are in the corpus but not covered by WordNet / \ { B }{ D } are also identified in this stage, and later at step { D }{ E} 3, we guess the meanings of these known words and place them into domain ontology. before after A total of 1,015 news reports on basketball games (1.7MB, Clarinet news, 1990-1991) were (B) Synset B and D appear in the corpus, collected. The frequency count reported totally A, C, and E do not. Note Synset A is not 1,414 unique (proper names excluded) removed since it’s the closest ancestor and 993 unique in the corpus. Compared of B and D. to 94,473 nouns and 10,318 verbs in WordNet 1.6, only 1.5% of nouns and 9.6% of verbs are Figure 1: Examples for corpus based pruning used in the domain. As we can see, this first pruning operation results in a significant reduc- Step 2. Pruning unrelevant senses us- tion of entries. For the words in the domain ing verb argument clusters corpus, while some words appear much more of- Our study in the basketball domain shows ten (such as the verb score, which appear 3,141 that a word is typically used uniformly in a times in 1,015 reports, average 3.1 times per specific domain, that is, it often has one or a article), some appear rarely (for example, the few predominant senses in the domain, and for verb atone only occur once in all reports). In a verb, its arguments tend to be semantically practical applications, low frequency words are close to each other and belong to a single or usually not handled by a generation system, so a few more general semantic category. In the the reduction rate should be even higher. following, we show by an example how the uni- 47 (3.3%) nouns and 22 (2.2%) verbs in the form usage of words in a domain can help to corpus are not covered by WordNet. These identify predominant senses and obtain seman- are domain specific words such as layup and tic constraints of verb arguments. layin. The small portion of these words shows In our basketball corpus, the verb add takes that WordNet is an appropriate general resource the following set of words as objects: (rebound, to use as a basis for building domain lexicons assist, throw, shot, basket, points). Based on and ontologies since it will probably cover most the assumption that a verb typically take argu- words in a specific domain. But the situation ments that belong to the same semantic cate- might be different if the domain is very specific, gory, we identify the senses of each word that for example, astronomy, in which case specific will keep it connected to the largest number of technical terms which are heavily used in the words in the set. For example, for the word re- domain might not be included in WordNet. bound, only one out of its three senses are linked to other words in the set, so it is marked as the exceptions. Table 1 shows some frequent verbs predominant sense of the word in the domain. in the domain and their selectional constraints. The algorithm we used to identify the predom- inant senses is similar to the algorithm we in- troduced in (Jing et al., 1997), which identi- {action} fies predominant senses of words using domain- / \ dependent semantic classifications and Word- {accomplishment, achievement} ... Net. In this case, the set of arguments for a / | | \ verb is considered as a semantic cluster. The {rebound} {assist} {throw} {basket} algorithm can be briefly summarized as follows: | {shot} Construct the set of arguments for a verb • Figure 2: Argument cluster Traverse the WordNet hierarchy and lo- for the verb ‘‘add’’ • cate all the possible links between senses of words in the set. ======The predominant sense of a word is the • WORD FREQ SUBJ OBJ sense which has the most number of links ======to other words in the set. score 789 player points (771) (789) basket ( 18) In this example, the words (rebound, assist, ------throw, shot, basket) will be disambiguated into add 329 player points the sense that will make all of them fall into the (accomplishment) same semantic subtree in WordNet hierarchy, as |-rebounds shown in Figure 2. The word points, however, | throws does not belong to the same category and is | shots not disambiguated. As we can see, the result is | assists much further pruned compared to result from - baskets step 1, with 5 out of 6 words are now disam------biguated into a single sense. At the mean while, hit 237 player (accomplishment) we have also obtained semantic constraints on |-jumper verb arguments. For this example, the object of | throws the verb add can be classified into two semantic | shots categories: either points or the semantic cate- - baskets gory (accomplishment, achievement). The clos------est common ancestor (accomplishment, achieve- outscore 45 team team ment) is used to generalize the semantic cate------gory of the arguments for a verb, even though beat 11 team team the word accomplishment and achievement are ======not used in the domain. This explains why in step 1 pruning, synsets that are the closest com- Table 1: Selectional Constraints mon ancestor of two synsets in the domain are in Basketball Domain always kept in the result. A simple parser is developed to extract sub- ject, object, and the main verb of a sentence. Note, the existing of predominant senses for We then ran the algorithm described above a word in a domain does not mean every occur- and obtained selectional constraints for frequent rence of the word must have the predominant verbs in the domain. The results show that, sense. For example, although the verb hit is for most of frequent verbs, majority of its argu- used mainly in the sense as in hitting a jumper, ments can be categorized into one or a few se- hitting a free throw in basketball domain, sen- mantic categories, with only a small number of tences like “The player fell and hit the floor” do appear in the corpus, although rarely. Such 4.2 Using the result for generation usage is not represented in our generalized se- The result we obtained after applying step 1 to lectional constraints on the verb arguments due step 3 of the above method is a reduced Word- to its low frequency. Net hierarchy, integrated with domain specific Step 3. Guessing unknown words and ontology. In addition, it is augmented with se- merging with domain specific ontologies. lection constraints and word frequency informa- tion acquired from corpus. Now we discuss the The grouping of verb arguments can also help usage of the result for generation. us to guess the meaning of unknown words. For example, the word layup is often used as Lexical Paraphrases. As we mentioned in the object of the verb hit, but is not listed in • Section 1, synsets can provide lexical para- WordNet. According to selectional constraints phrases, the problem to be solved is deter- from step 2, the object of the verb hit is typi- mining which words are interchangeable in a cally in the semantic category (accomplishment, particular context. In our result, the words achievement). Therefore, we can guess that the that appear in a synset but are not used in word layup is probably in the semantic category the domain are eliminated by corpus analy- too, though we do not know exactly where in sis, so the words left in the synsets are basi- the semantic hierarchy of Figure 2 to place the cally all applicable to the domain. They can, word. however, be further distinguished by the se- lectional constraints. For example, if A and B We discussed above how to prune WordNet, are in the same synset but they have different whereas the other part of work in adapting constraints on their arguments, they are not WordNet to a domain is to integrate domain- interchangeable. Frequency can also be taken specific ontologies with pruned WordNet ontol- into account. A low frequency word should be ogy. There are a few possible operations to do avoided if there are other choices. Words left this: (1) Insertion. For example, in basketball after these restrictions can be considered as domain, if we have an ontology adapted from interchangeable synonyms and used for para- WordNet by following step 1 and 2, and we phrasing. also have a specific hierarchy of basketball team Discrimination net for lexicalization. names, a good way to combine them is to place • the hierarchy of team name under an appropri- The reduced WordNet hierarchy together ate node in WordNet hierarchy, such as the node with selectional and frequency constraints (basketball team). (2) Replacement. For exam- made up a discrimination net for lexicaliza- ple, in medical domain, we need an ontology of tion. The selection can be based on the gen- medical disorders. WordNet includes some in- erality of the words, for example, a jumper is formation under the node “Medical disorder”, a kind of throw. If a user wants the output but it might not be enough to satisfy the ap- to be as detailed as possible, we can say “He plication’s need. If such information, however, hit a jumper”, otherwise we can say “He hit can be obtained from a medical , we a throw.” can then substitute the subtree on “medical dis- Selectional constraints can also be used in order” in WordNet with the more complete and selecting words. For example, both the reliable hierarchy from a medical dictionary. (3) word win and score can convey the mean- Merging. If WordNet and domain ontology con- ing of obtaining advantages, gaining points tain information on the same topic, but knowl- etc, and win is a hypernym of score. In edge from either side is incomplete, to get a the basketball domain, win is mainly used as better ontology, we need to combine the two. win(team, game), while score is mainly used We studied ontologies in five generation systems as score(player, points), so depending on the in medical domain, telephone network planning, categories of input arguments, we can choose web log, basketball, and business domain. Gen- between score and win. erally, domain specific ontology can be easily merged with WordNet by either insertion or re- Frequency can also be used in a way similar to placement operation. the above. Although selectional constraints and frequency are useful criteria for lexical se- 5 Future work and conclusion lection, there are many other constraints that In this paper, we demonstrate that WordNet is can be used in a generation system for select- a valuable resource for generation: it can pro- ing words, for example, syntactic constraints, duce large amount of paraphrases, provide se- discourse, and focus etc. These constraints mantic net for lexicalization, and can be used are usually coded in individual systems, not for building domain ontologies. obtained from WordNet. The main problem we discussed is adapting Domain ontology. From step 3, we can WordNet to a specific domain. We propose a • acquire a unified ontology by integrating the three step procedure based on corpus analysis to pruned WordNet hierarchy with domain spe- solve the problem. First, The general WordNet cific ontologies. The unified ontology can then ontology is pruned based on a domain corpus, be used by planning and lexicalization com- then verb argument clusters are used to further ponents. How different modules use the on- prune the result, and finally, the pruned Word- tology is a generation issue, which we will not Net hierarchy is integrated with domain specific address in the paper. ontology to build a unified ontology. The other problems we discussed are how WordNet knowl- edge can be used in generation and how to aug- 4.3 Combining other types of ment WordNet with other types of knowledge. knowledge for generation In the future, we would like to test our tech- Although WordNet contains rich lexical knowl- niques in other domains beside basketball, and edge, its information on verb argument struc- apply such techniques to practical generation tures is relatively weak. Also, while Word- systems. Net is able to provide lexical paraphrases by its synsets, it can not provide syntactic para- Acknowledgment phrases for generation. Other resources such This material is based upon work supported by as COMLEX syntax dictionary (Grishman et the National Science Foundation under Grant al., 1994) and English Verb Classes and Al- No. IRI 96-19124, IRI 96-18797 and by a grant ternations(EVCA) (Levin, 1993) can provide from Columbia University’s Strategic Initiative verb subcategorization information and syntac- Fund. Any opinions, findings, and conclusions tic paraphrases, but they are indexed by words or recommendations expressed in this material thus not suitable to use in generation directly. are those of the authors and do not necessarily To augment WordNet with syntactic infor- reflect the views of the National Science Foun- mation, we combined three other resources dation. with WordNet: COMLEX, EVCA, and Tagged Brown Corpus. The resulting database contains References not only rich lexical knowledge, but also sub- J.A. Bateman, R.T. Kasper, J.D. Moore, and stantial syntactic knowledge and language us- R.A. Whitney. 1990. A general organization age information. The combined database can be of knowledge for natural language processing: adapted to a specific domain using similar tech- the penman upper-model. Technical report, niques as we introduced in this paper. We ap- ISI, Marina del Rey, CA. plied the combined lexicon to PLanDOC (McK- Laurence Danlos. 1987. The Linguistic Basis eown et al., 1994), a practical generation system of Text Generation. Cambridge University for telephone network planning. Together with Press. a flexible architecture we designed, the lexicon Ralph Grishman, Catherine Macleod, and is able to effectively improve the system para- Adam Meyers. 1994. COMLEX syntax: phrasing power, minimize the chance of gram- Building a computational lexicon. In Proceed- matical errors, and simplify the development ings of COLING-94, Kyoto, Japan. process substantially. The detailed description Hongyan Jing and Kathleen McKeown. 1998. of the combining process and the application of Combining multiple, large-scale resources in the lexicon is presented in (Jing and McKeown, a reusable lexicon for natural language gen- 1998). eration. In To appear in the Proceedings of COLING-ACL’98, University of Montreal, Montreal, Canada, August. Hongyan Jing, Vasileios Hatzivassiloglou, Re- becca Passonneau, and Kathleen McKeown. 1997. Investigating complementary methods for verb sense pruning. In Proceedings of ANLP’97 Workshop, pages 58–65, Washington, D.C., April. Beth Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago, Illinois. Kathleen McKeown, Karen Kukich, and James Shaw. 1994. Practical issues in automatic documentation generation. In Proceedings of the Applied Natural Language Processing Conference 94, pages 7–14, Stuttgart, Ger- many, October. George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. 1990. Introduction to WordNet: An on-line lexical database. International Jour- nal of (special issue), 3(4):235– 312.