Statistical Acquisition of Content Selection Rules for Natural Language Generation
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Acquisition of Content Selection Rules for Natural Language Generation Pablo A. Duboue and Kathleen R. McKeown Department of Computer Science Columbia University pablo,kathy ¡ @cs.columbia.edu Abstract Designing content selection rules manually is a A Natural Language Generation system tedious task. A realistic knowledge base contains produces text using as input semantic data. a large amount of information that could potentially One of its very first tasks is to decide be included in a text and a designer must examine which pieces of information to convey in a sizable number of texts, produced in different sit- the output. This task, called Content Se- uations, to determine the specific constraints for the lection, is quite domain dependent, requir- selection of each piece of information. ing considerable re-engineering to trans- Our goal is to develop a system that can auto- port the system from one scenario to an- matically acquire constraints for the content selec- other. In (Duboue and McKeown, 2003), tion task. Our algorithm uses the information we we presented a method to acquire con- learned from a corpus of desired outputs for the sys- tent selection rules automatically from a tem (i.e., human-produced text) aligned against re- corpus of text and associated semantics. lated semantic data (i.e., the type of data the sys- Our proposed technique was evaluated by tem will use as input). It produces constraints on comparing its output with information se- every piece of the input where constraints dictate if lected by human authors in unseen texts, it should appear in the output at all and if so, under where we were able to filter half the input what conditions. This process provides a filter on the data set without loss of recall. This report information to be included in a text, identifying all contains additional technical information information that is potentially relevant (previously about our system. termed global focus (McKeown, 1985) or viewpoints (Acker and Porter, 1994)). The resulting informa- 1 Introduction tion can be later either further filtered, ordered and CONTENT SELECTION is the task of choosing the augmented by later stages in the generation pipeline right information to communicate in the output of a (e.g., see the spreading activation algorithm used in Natural Language Generation (NLG) system, given ILEX (Cox et al., 1999)). semantic input and a communicative goal. In gen- The research described here is part of the efforts eral, Content Selection is a highly domain dependent for the automatic construction of the content plan- task; new rules must be developed for each new do- ning module of PROGENIE, a biography generator main, and typically this is done manually. Morevoer, (Duboue et al., 2003). it has been argued (Sripada et al., 2001) that Content We focus on descriptive texts which realize a sin- Selection is the most important task from a user’s gle, purely informative, communicative goal, as op- standpoint (i.e., users may tolerate errors in wording, posed to cases where more knowledge about speaker as long as the information being sought is present in intentions are needed. In particular, we present ex- the text). periments on biographical descriptions, where the Actor, born Thomas Connery on August 25, 1930, in Fountain- pairs (Figure 2 (b)). bridge, Edinburgh, Scotland, the son of a truck driver and char- We simplify Content Selection to better fit a learn- woman. He has a brother, Neil, born in 1938. Connery dropped ing approach. In particular, we focus on descriptive out of school at age fifteen to join the British Navy. Connery is texts (single, informative, communicative goal) and best known for his portrayal of the suave, sophisticated British the mining of high-level content selection rules, to spy, James Bond, in the 1960s. filter out the input. Our approach is thus the learning of content se- Figure 1: Sample Target Biography. lection rules. The input to the system is a text and knowledge resource, or TKR (Figure 3 (a)). A TKR is a set of human-written text and knowledge base planned system will generate short paragraph length pairs. The knowledge base given as input will in- texts summarizing important facts about famous clude the type of data that a purported generator will people. The kind of text that we aim to generate use to generate a text that satisfies the same prag- is shown in Figure 1. The rules that we aim to ac- matic (i.e., communicative) goals being conveyed in quire will specify the kind of information that is typ- the human input text. Nevertheless, it does not con- ically included in any biography. In some cases, stitute an exact description of the semantics of the whether the information is included or not may be text. For our task at hand, a TKR constitutes weak conditioned on the particular values of known facts evidence for learning, as more processing is required (e.g., the occupation of the person being described to determine whether the information appears in the —we may need different content selection rules for text. Direct evidence for learning a Content Selec- artists than politicians). To proceed with the experi- tion module is shown in Figure 3 (b), in the form of ments described here, we acquired a set of semantic knowledge and labels (selected or unselected) over information and related biographies from the Inter- the knowledge. net and used this corpus to learn Content Selection The output of our system are Content Selection rules. The resource is introduced in Section 1.2; de- rules, constrained by what is in the data. We mine tails of its construction are contained in Section 3.1. rules expressed in two different languages: the class- Our main contribution is to analyze how varia- based rule language and the full-fledged content se- tions in the data influence changes in the text. We lection rule language. In the class-based language, perform such analysis by splitting the semantic input the decision of whether to include each piece of into clusters and then comparing the language mod- information is made solely in their class, as deter- els of the associated clusters induced in the text side mined by its path from the root of the semantic rep- (given the alignment between semantics and text in resentation to the actual piece of data (data path). In the corpus). By doing so, we gain insights on the rel- the full fledged language, the values of all data in the ative importance of the different pieces of data and, knowledge base are used to decide whether or not to thus, find out which data to include in the generated include each particular piece of it. text. Central to our approach is the notion of data The objective of this report is to explain at length paths in the semantic network (an example is shown the system and experiments summarized in (Duboue in Figure 4). Given a frame-based representation of and McKeown, 2003) and it is divided into two knowledge, we need to identify particular pieces of parts: details regarding the algorithm (Section 2) and knowledge inside the graph. We do so by selecting details regarding the experiments (Section 3). a particular frame as the root of the graph (the person whose biography we are generating, in our 1.1 Task and Methods case, doubly circled in the figure) and considering We thus work on the Content Selection task, defined the paths in the graph as identifiers for the different as choosing the right information to communicate in pieces of data. We call these data paths. Each a NLG system. An example is shown in Figure 2. Its path will identify a class of values, given the input is a set of attribute-value pairs (Figure 2 (a)); fact that some attributes are list-valued (e.g., the and its output is a subset of the input attribute-value relative attribute in the figure). We use the notation ¡ name first ¡ John name last Doe ¡ weight ¡ 150Kg height 160cm ¡ occupation ¡ c-writer occupation c-producer ¡ award title ¡ BAFTA award year 1999 ¡ relative type ¡ c-grandson rel. firstN Dashiel ¡ rel. lastN ¡ Doe rel. birthD 1990 (a) ¡ name first ¡ John name last Doe ¡ occupation ¡ c-writer occupation c-producer (b) John Doe is a writer, producer, . (c) Figure 2: Content Selection Example. (a) Input: Set of Attribute Value Pairs. (b) Output: Selected Attribute- Value Pairs. (c) Verbalization of the selected knowledge. ¡ name first ¡ John name last Doe ¡ weight ¡ 150Kg height 160cm ¢ John Doe, American writer, born in Maryland in 1967, fa- mous for his strong prose and sharp remarks,. (a) ¡ name first ¡ John name last Doe ¡ weight ¡ 150Kg height 160cm ¢ ¡ name first ¡ John name last Doe ¡ weight ¡ 150Kg height 160cm (b) Figure 3: Input to our learning system. (a) Actual input, a set of associated knowledge base and text pairs (weak evidence for learning) (b) Fully supervised input (obtained afterwards as a byproduct of the processing done by our system). "Thomas" name−1 attracted practitioners of NLG in the past (Kim et name first al., 2002; Schiffman et al., 2001; Radev and McK- "Sean" birth−1 middle eown, 1997; Teich and Bateman, 1994). It has the date last advantages of being a constrained domain amenable birth "Connery" ... year to current generation approaches, while at the same date−1 1930 ... time offering more possibilities than many con- person−2654 occupation ... ... strained domains, given the variety of styles that bi- occupation−1 relative c−actor ographies exhibit, as well as the possibility for ulti- TYPE c−grand−son relative "Jason" mately generating relatively long biographies. TYPE relative−1 c−son first relative−2 TYPE ..