From: AAAI-93 Proceedings. Copyright © 1993, AAAI (www.aaai.org). All rights reserved.

Jacques Robin and Kathleen McKeown Department of Computer Science Columbia University New York, NY 10027 { robin,kathy}@cs.columbia.edu

Abstract 1. Draft sentence: “San Antonio, TX - David Robinson scored 32 points The complex sentences of newswire reports con- Friday night lifting the San Antonio Spurs to a 127 111 tain floating content units that appear to be op- victory over the Denver Nuggets.” portunistically placed where the form of the sur- Clause coordination with reference adjustment: rounding text allows. We present a corpus anal- “San Antonio, TX - David Robinson scored 32 points ysis that identified precise semantic and syntactic Friday night LIFTING THE SAN ANTONIO SPURS TO A constraints on where and how such information is 127 111 VICTORY OVER DENVER and handing the realized. The result is a set of revision tools that Nuggets their seventh straight loss”. form the rule base for a report generation system, Embedded nominal apposition: allowing incremental generation of complex sen- “San Antonio, TX - David Robinson scored 32 points tences. Friday night lifting the San Antonio Spurs to a 127 111 victory over THE DENVER NUGGETS, losers of seven in a row”. Introduction

Generating reports that summarize quantitative data Figure 1: Attaching a floating content unit onto raises several challenges for language generation sys- different draft sentence SUBCONSTITUENTS tems. First, sentences in such reports are very com- plex (e.g., in newswire game summaries the lead sentence ranges from 21 to 46 words in length). Second, while some content units consistently appear many domains and is receiving growing attention (cf. in fixed locations across reports (e.g., game results are [Rubinoff, 19901, [Elhadad and Robin, 19921, [Elhadad, always conveyed in the lead sentence), others float, ap- 19931). pearing anywhere in a report and at different linguis- These observations suggest a generator design where tic ranks within a given sentence. Floating content a draft incorporating fixed content units is produced units appear to be opportunistically placed where the first and then any floating content units that can be form of the surrounding text allows. For example, in accommodated by the surrounding text are added. Ex- Fig. 1, sentences 2 and 3 result from adding the same periments by [Pavard, 19851 provide evidence that only streak information (i.e., data about a series of simi- such a revision-based model of complex sentence gen- lar outcomes) to sentence 1 using different syntactic eration can be cognitively plausible’. categories at distinct structural levels. To determine how floating content units can be in- Although optional in any given sentence, floating corporated in a draft, we analyzed a corpus of basket- content units cannot be ignored. In our domain, they ball reports, pairing sentences that differ semantically account for over 40% of lead sentence content, with by a single floating content unit and identifying the some content types only conveyed as floating struc- minimal syntactic transformation between them. The tures. One such type is historical information (e.g., result is a set of revision too/s, specifying precise se- maximums, minimums, or trends over periods of time). mantic and syntactic constraints on (1) where a par- Its presence in all reports and a majority of lead sen- ticular type of floating content can be added in a draft tences is not surprising, since the relevance of any game and (2) what linguistic constructs can be used for the fact is often largely determined by its historical signif- addition. icance. However, report generators to date [Kukich, The corpus analysis presented here serves as the ba- 19831, [Bourbeau et ad., 19901 are not capable of includ- sis for the development of the report generation sys- ing this type of information due to its floating nature. The issue of optional, floating content is prevalent in ‘cf. [Robin, 19921 for discussion.

Natural Language Generation 365 Basic Sentence Example

Patrick Ewing scored 41 points Tuesday night to lead the New York Knicks to a 97-79 win over the Hornets

Complex Sentence

Karl Malone scored 28 points Saturday and leading the to its fourth straight victory, added a season-high 27 points and a league-high 23 assists a 105-95 win over the Los Angeles Clippers

Figure 2: Similarity of basic and complex sentence structures

tern STREAK (Surface Text Revision Expressing Addi- e Other final game statistics (e.g., “Stockton finished tional Knowledge). The analysis provides not only the with 27 points”). knowledge sources for the system and motivations for o Streaks of similar results (e.g., “Utah recorded its its novel architecture (discussed in [Robin, 1993]), but fourth straight win”). also with means for ultimately evaluating its output. e Record performances (e.g., Ttockton scored a While this paper focuses on the analysis, the on-going season-high 27 points”). system implementation based on functional unification is discussed in [Robin, 19921. After describing our corpus analysis methodology, Complex Sentence Structure We noted that bu- we present the resulting revision tools and how they sic corpus sentences, containing only the four fixed con- can be used to incrementally generate complex sen- tent units, and complex corpus sentences, which in ad- tences. We conclude by previewing our planned use of dition contain up to five floating content units, share the corpus for evaluation and testing. a common top-level structure. This structure consists of two main constituents, one containing the notable statistic (the notable statistic cluster) and the other Corpus analysis methodology containing the game result (the game result cluster), which are related either paratactically or hypotacti- We analyzed the lead sentences of over 800 basket- tally with the notable statistic cluster as head. Hence, ball games summaries from the UPI newswire. We the only structural difference is that in the complex focused on the first sentence after observing that all sentences additional floating content units are clus- reports followed the inverted pyramid structure with tered around the notable statistic and/or the game re- summary lead [Fensch, 19881 where the most crucial sult. For example, the complex sentence at the bottom facts are packed in the lead sentence. The lead sen- of Fig. 2 has the same top-level structure as the basic tence is thus a self-contained mini-report. We first sentence at the top, but four floating content units are noted that all 800 lead sentences contained the game clustered around its notable statistic and a fifth one result (e.g., “Utah beat Miami I&5-95”), its location, with its game result. Furthermore, we found that when date and at least one final game statistic: the most floating elements appear in the lead sentence, their se- notable statistic of a winning team player. We then mantics almost always determines in which of the two semantically restricted our corpus to about 300 lead clusters they appear (e.g., streaks are always in the sentences which contained only these four fixed con- game result cluster). tent units and zero or more floating content units of These corpus observations show that any complex the most common types, namely: sentence can indeed be generated in two steps: (1) pro-

366 Robin duce a basic sentence realizing the fixed content units, onto a verb ( “to defeat”) in &I. &, rather than Rdi (2) incrementally revise it to incorporate floating con- is thus the surface decrement of Ri2. tent units. Furthermore, they indicate that floating We identified 270 surface decrement pairs in the cor- content units can be attached within a cluster, based pus. For each such pair, we then determined the struc- on local constraints, thus simplifying both generation tural transformations necessary to produce the more and our corpus analysis. When we shifted our atten- complex pattern from the simpler base pattern. We tion from whole sentence structure to internal cluster grouped these transformations into classes that we call structures, we split the whole sentence corpus into two revision tools. subsets: one containing notable statistic clusters and the other, game result clusters. evisions for incremental generation

Cluster structure To identify syntactic and lexi- We distinguished two kinds of revision tools. Simple cal constraints on the attachment of floating content revisions consist of a single transformation which pre- units within each cluster, we analyzed the syntactic serves the base pattern and adds in a new constituent. form of each cluster in each corpus lead sentence to Complex revisions are in contrast non-monotonic; an derive realization patterns. Realization patterns ab- introductory transformation breaks up the integrity of stract from lexical and syntactic features (e.g., connec- the base pattern in adding in new content. Subsequent tives, mood) to represent the different mappings from restructuring transformations are then necessary to re- semantic structure to syntactic structure. Examples store grammaticality. Simple revisions can be viewed of realization patterns are given in Fig. 3. Each col- as elaborations while complex revisions require true re- umn corresponds to a syntactic constituent and each vision. entry provides information about this constituent: (1) semantic content2, (2) grammatical function, (3) struc- Simple revisions We identified four main types tural status (i.e. head, argument, adjunct etc) and (4- of simple revisions: Adj oin5, Append, Conjoin and 5) syntactic category3. Below each pattern a corpus Absorb. Each is characterized by the type of base example is given. structure to which it applies and the type of revised Realization patterns represent the structure of en- structure it produces. For example, Adjoin applies tire clusters, whether basic or complex. To discover only to hypotactic base patterns. It adds an adjunct how complex clusters can be derived from basic ones A, under the base pattern head Bh as shown in Fig. 5. through incremental revision, we carried out a differ- Adjoin is a versatile tool that can be used to in- ential analysis of the realization patterns based on the sert additional constituents of various syntactic cate- notions of semantic decrement and surface decrement gories at various syntactic ranks. The surface decre- illustrated in Fig. 4. ment pair < &I, Ril > in Fig. 3 is an example of A cluster Cd is a semantic decrement of cluster Ci if clause rank PP adjoin. In Fig. 6, the revision of Cd contains all but one of Ci’s content unit types. Each sentence 5 into sentence 6 is an example of nominal cluster has a set of realization patterns. The surface rank pre-modif ier adjoin: “franchise record” is ad- decrement of a realization pattern of Ci is the realiza- joined to the nominal “sixth straight home defeat”. tion pattern of Cd that is structurally closest. Figure 3 In the same figure, the revision of sentence 2 into shows a semantic decrement pairing Cd, a single con- sentence 3 is an example of another versatile tool, tent unit, with Ci, which contains two content units. Conjoin: an additional clause, “Jay Humphries added Both clusters have two realization patterns associated is coordinated with the draft clause with them as they each can be realized by two differ- 24 “7 “Karl Malone tied a season high with 39 points”. In general, Conjoin ent syntactic structures. These four syntactic structure groups a new constituent A, with a base constituent patterns must be compared to find the surface decre- B,l in a new paratacti$ complex. The new complex is ments. Since &i is entirely included in Ril* it is the then inserted where B ci alone was previously located surface decrement of Ril. To identify the surface decre- (cf. Fig. 5). N o t e h ow in Fig. 1 paraphrases are ob- ment of Ri2, we need to compare it to h& and &i in tained by applying Conjoin at different levels in the turn. All the content units common to Ri2 and Rd2 are base sentence structure. realized by identical syntactic categories in both pat- terns. In particular, the semantic head (game-result) Instead of creating a new complex, Absorb relates the new constituent to the base constituent B,i by de- is mapped onto a noun ( “victory” in Rd2, “triumph” in moting B,i under the new constituent’s head Ah which Riz). In contrast, this same semantic head is mapped is inserted in the sentence structure in place of B,i as 2An empty box corresponds to a syntactic constituent required by English grammar but not in itself conveying 50ur Adjoin differs from the adjoin of Tree-Adjoining any semantic element of the domain representation. Grammars (TAGS). Although, TAGS could implement 3The particular functions and categories are based on: three of our revision tools, Adjoin, Conjoin and Append, [Quirk et al., 19851, [Halliday, 19851 and [Fawcett, 19871. it could not directly implement non-monotonic revisions. 4Remember that we compare patterns, not sentences. 6 Eit her coordinated or appositive.

Natural Language Generation 367 Ci: . cd: .

Ril(Ci): winner game-result loser score ] length 1 streak+aspect 1 type agent process affected score result a% head a% adjunct adjunct proper verb proper number PP prep ] [det 1 ordinal ] adj noun Chicago beat Phoenix 99-91 for its third straight win

Rdl(Cd) surface decrement of Ril (Ci): winner game-result loser score agent process affected score a?% head a% adjunct proper verb proper number Seattle defeated Sacramento 121-93

Ri2(Ci): winner aspect I type I streak length score I game-result I loser agent process affected located location means a% head a% adjunct adjunct proper verb NP PP PP det 1 participle I noun _ prep I [det I number I noun PP] Utah extended its winning streak to six games with a 118-94 triumph over Denver

Rd2( Cd) surface decrement of Ri2( Ci): winner score I game-result I loser agent process range a% head a% proper support-verb NP det I number I noun PP Chicago claimed a 128-94 victory over New Jersey

Figure 3: Realization pattern examples

Content Unit Type Clusters Space of Realization Patterns

Ci = {Ul . . . Un-1,Un)

Semantic Decrement

I Cd = {Ul . .. Un-1) Linguistic Realization

Figure 4: Differential analysis of realization patterns

368 Robin Base Structures evised Structures

Adjoin

Conjoin

Adjunctization

Nominalization odifier

Figure 5: Structural schemas of five revision tools

Natural Language Generation 369 1. Initial draft (basic sentence pattern): “Hartford, CT - Karl Malone scored 39 points Friday night as the Utah Jazz defeated the 118 94.” 2. adjunctization: “Hartford, CT - Karl Malone tied a season high with S9 points Friday night as the Utah Jazz defeated the Boston Celtics 118 94.” 3. conjoin: “Hartford, CT - Karl Malone tied a season high with 39 points and Jay Humphries added 24 Friday night as the Utah Jazz defeated the Boston Celtics 118 94.” 4. absorb: “Hartford, CT - Karl Malone tied a season high with 39 points and Jay Humphries came off the bench to add 24 Friday night as the Utah Jazz defeated the Boston Celtics 118 94.” 5. nominalization: “Hartford, CT - Karl Malone tied a season high with 39 points and Jay Humphries came off the bench to add 24 Friday night as the Utah Jazz handed the Boston Celtics their sixth straight home defeat 118 94.” 6. adjoin: “Hartford, CT - Karl Malone tied a season high with 39 points and Jay Humphries came off the bench to add 24 Friday night as the Utah Jazz handed the Boston Celtics their franchise record sixth straight home defeat 118 94.”

Figure 6: Incremental generation of a complex sentence using various revision tools

shown in Fig. 5. For example, in the revision of sen- mous collocation < V, ,Nf > tence 3 into sentence 4 Fig. 6, the base VP “udded where Nf is a nominalization of Vf . A new constituent 24” gets subordinated under the new VP Qume oJgT A, can then be attached to Nf as a pre-modifier as the bench” taking its place in the sentence structure. shown in Fig. 5. For example, in revising sentence 4 See [Robin, 19921 for a presentation of Append. into sentence 5 (Fig. 6), the full-verb pattern “X de- feated Y” is first replaced by the collocation pattern Once nominalized, can Complex revisions We identified six main types “X handed Y a defeat”. defeat then be pre-modified by the constituents of complex revisions: Recast, Argument Demotion, “their”, “sixth and providing historical background. Nominalization, Adjunctization, Constituent straight” “home” See [Robin, 19921 for presentation of the four remain- promotion and Coordination Promotion. Each is ing types of complex revisions. characterized by different changes to the base which displace constituents, alter the argument structure or change the lexical head. Complex revisions tend to be Side transformations Restructuring transforma- more specialized tools than simple revisions. For ex- tions are not the only type of transformations following ample, Adjunct ization applies only to clausal base the introduction of new content in a revision. Both patterns headed by a support verb VS. A support verb simple and complex revisions are also sometimes ac- [Gross, 19841 does not carry meaning by itself, but pri- companied by side transformations. Orthogonal to re- marily serves to support one of its meaning bearing structuring transformations which affect grammatical- arguments. Adjunct izat ion introduces new content ity, side transformations make the revised pattern more by replacing the support verb by a full verb Vj with concise, less ambiguous, or better in use of collocations. a new argument A,. Deprived of its verbal support, We identified six types of side transformations in the the original support verb argument Bc2 migrates into corpus: Reference Adjustment, Ellipsis, Argument adjunct position, as shown in Fig. 5. Control,Ordering Adjustment,Scope Marking and The surface decrement pair < h&, Ri2 > of Fig. 3 Lexical Adjustment. The revision of sentence 1 into is an example of Adjunct izat ion: the RANGE argu- sentence 2 in Fig. 1 is an example of simple revision ment of Rd2 migrates to become a MEANS adjunct in with Reference Adjustment. Following the introduc- tion of a second reference to the losing team R.221 when the head verb is replaced. The revision “the of sentence 1 into sentence 2 in Fig. 6 is a specific Nuggets . . ,“, the initial reference is abridged to sim- Adjuuctization example: ‘20 scure” is replaced by ply “Denver” to avoid the repetitive form “u 1.27 111 “‘to tie “, forcing the NP “39 points” (initially argu- victory over the Denver Nuggets, handing the Denver ment of ‘^to score”) to migrate as a PP adjunct “‘with Nuggets their seventh straight doss”. See [Robin, 19921 39 points”. for a presentation of the other types of side transfor- In the same figure, the revision of sentence 4 into sen- mations. tence 5 is an example of another complex revision tool, Nominalization. As opposed to Adjunctization, Revision tool usage Table 1 quantifies the usage Nominalization replaces a full verb I/f by a synony- of each tool in the corpus. The total usage is broken

370 Robin scope 1 streak ext rema transformations adjoin 88 1 5 46 append 10 nominal 2 - conjoin 127 clause 91 5 NP coordination 28 - NP apposition 8 8

absorb 11 clause 10 3 nominal 1 - recast none clause 3 2 1

argument none demotion nominalization 1 none - adjunctization 20 reference adjust 1 6 8 4 constituent none - promotion coordination none ““” 1 - promotion 3 270 1 138 58 3145 Table 1: Revision tool usage in the corpus

down by linguistic rank and by class of floating content These properties allow revision tools to opportunisti- units (e.g., Adjoin was used 88 times in the corpus, 23 cally express floating content units under surface form times to attach a streak at the nominal rank in the constraints and to model a sublanguage’s structural base sentence). Occurrences of side transformations complexity and diversity with maximal economy and are also given. flexibility. Our analysis methodology based on surface Figure 6 illustrates how revision tools can be used to decrement, pairs can be used with any textual corpus. incrementally generate very complex sentences. Start- Revision tools also bring together incremental gen- ing from the draft sentence 1 which realizes only four eration and revision in a novel way, extending both fixed content units, five revision tools are applied in lines of research. The complex revisions and side trans- sequence, each one adding a new floating content unit. formations we identified show that accomodating new Structural transformations undergone by the draft at content cannot always be done without modifying the each increment are highlighted: deleted constituents draft content realization. They therefore extend pre- are underlined, added constituents boldfaced and dis- vious work on incremental generation [Joshi, 19871 placed constituents italicized. Note how displaced con- [De Smedt, 19901 that was restricted to elaborations stituents sometimes need to change grammatical form preserving the linguistic form of the draft, content. As (e.g., the finite VP “added 24” of (3) becomes infinitive content-adding revisions, the tools we identify also ex- “to add 24” in (4) after being demoted). tend previous work on revision [Meteer, 19911 [Inui et a/., 19921 that was restricted to content-preserving re- Conclusion and future work visions for text editing. The detailed corpus analysis reported here resulted in a In addition to completing the implementation of the list of revision tools to incrementally incorporate addi- tional content into draft sentences. These tools consti- tools we identified as revision rules for the STREAK gen- erator, our plans for future work includes the evalua- tute a new type of linguistic resource which improves The described in this paper on the realization patterns traditionally used in gen- tion of these tools. corpus was used for acquisition. For testing, we will use two eration systems (e.g., [Kukich, 19831, [Jacobs, 19851, other corpora. To evaluate completeness, we will look [Hovy, 19881) due to three distinctive properties: at another season of basketball reports and compute e They are compositional (concerned with atomic con- the proportion of sentences in this test corpus whose tent additions local to sentence subconstituents). realization pattern can be produced by applying the e They incorporate a wide range of contextual con- tools acquired in the initial corpus. Conversely, to eval- straints (semantic, lexical, syntactic, stylistic). uate domain-independence, we will compute, among o They are abstract (capturing common structural re- the tools acquired in the initial corpus, the proportion lations over sets of sentence pairs). of those resulting in realization patterns also used in a

Natural Language Generation 371 test corpus of stock market reports. The example be- systemic linguistics. Frances Pinter, London and New low suggests that the same floating constructions are York. used across different quantitative domains: Fensch, T. 1988. The sports writing handbook. e “ Los Angeles - John Paxson hit 12 of 16 shots Fri- Lawrence Erlbaum Associates, Hillsdale, NJ. day night to score a season high 26 points help- Gross, M. 1984. Lexicon-grammar and the syntac- ing the Chicago Bulls snap a two game losing tic analysis of french. In Proceedings of the l&h In- streak with a 105 97 victory over the Los Angeles ternational Conference on Compututionak Linguistics. Clippers.” COLING. 275-282. e “New York - Stocks closed higher in heavy trading Halliday, M.A.K. 1985. An introduction to functional Thursday, as a late round of computer-guided buy grammar. Edward Arnold, London. programs tied to triple-witching hour helped the Hovy, E. 1988. Generating natural language under market snap a five session losing streak.” pragmatic constraints. L. Erlbaum Associates, Hills- Although the analysis reported here was carried out dale, N.J. manually for the most part, we hope to automate Inui, K.; Tokunaga, T.; and Tanaka, H. 1992. Text most of -the evaluation phase using the software tool revision: a model and its implementation. In Dale, R.; CREP [Duford, 19931. CREP retrieves corpus sentences Hovy, E.; Roesner, D.; and Stock, O., editors 1992, matching an input, realization pattern encoded as a Aspects of Automated Natural Language Generation. regular expression of words and part-of-speech tags. Springler-Verlag. 215-230. Jacobs, P. 1985. PHRED: a generator for natu- Acknowledgments ral language interfaces. Computationad Linguistics Many thanks to Tony Weida and Judith Klavans for 11(4):219-242. their comments on an early draft of this paper. This Joshi, A.K. 1987. The relevance of tree-adjoining research was partially supported by a joint grant from grammar to generation. In Kempen, Gerard, edi- the Office of Naval Research and the Defense Advanced tor 1987, Natural Language Generation: New Results Research Projects Agency under contract N00014-89- in Artificial Intellligence, Psychology and Linguistics. J-1782, by National Science Foundation Grants IRT- Martinus Ninjhoff Publishers. 84-51438 and GER-90-2406, and by New York State Kukich, K. 1983. The design of a knowledge-based Center for Advanced Technology Contract NYSSTF- report generation. In Proceedings of the 2lst-Confer- CAT(92)-053 and NYSSTF-CAT(91)-053. ence of the ACL. ACL. Meteer, M. 1991. The implications of revisions for References natural language generation. In Paris, C.; Swartout, Bourbeau, L.; Carcagno, D.; Goldberg, E.; Kittredge, W.; and Mann, W.C., editors 1991, Nututul Language R.; and Polguere, A. 1990. Bilingual generation of Generation in Artificial Intelligence and Computu- weather forecasts in an operations environment. In tional Linguistics. l%luwer Academic Publishers.- Proceedings of the 13th International Conference on Pavard, B. 1985. La conception de systemes de traite- Computational Linguistics. COLING. ment de texte. Intedlectica l( 1):37-67. De Smedt, K.J.M.J. 1990. Ipf: an incremental parallel Quirk, R.; Greenbaum, S.; Leech, G.; and Svartvik, formulator. In Dale, R.; Mellish, C.S.; and Zock, M., J. 1985. A comprehensive grammar of the English editors 1990, Current Research in Natural Language language. Longman. Generation. Academic Press. Robin, J. 1992. Generating - newswire report leads Duford, D. 1993. Crep: a regular expression-matching with historical information: a draft and revision textual corpus tool. Technical Report CUCS-005-93, approach. Technical Report CUCS-042-92, Com- Columbia University. puter Science Department, Columbia Universtity, Elhadad, M. and Robin, J. 1992. Controlling content New York, NY. PhD. Thesis Proposal. realization with functional unification grammars. In Robin, J. 1993. A revision-based generation archi- Dale, R.; Hovy, H.; Roesner, D.; and Stock, O., ed- tecture for reporting facts in their historical context. itors 1992, Aspects of Automated Natural Language In Horacek, H. and Zock, M., editors 1993, New Con- Generation. Springler Verlag. 89-104. cepts in Natural Language Generation: Planning, Re- Elhadad, M. 1993. Using argumentation to control alization and Systems. Frances Pinter, London and lexical choice: a unification-based implementation. New York. Ph.D. Dissertation, Computer Science Department, Rubinoff, R. 1990. Natural language generation as an Columbia University. intelligent activity. PhD. Thesis Proposal, Computer Fawcett, R.P. 1987. The semantics of clause and verb Science Department, University of Pennsylvania. for relational processes in english. In Halliday, M.A.K. and Fawcett, R.P., editors 1987, New developments in

372 Robin