<<

RevisionBased Generation of

Natural Language Summaries

Providing Historical Background

CorpusBased Analysis Design Implementation and Evaluation

Jacques Robin

Technical Rep ort CUCS

Submitted in partial fulllment of the

requirements for the degree

of Do ctor of Philosophy

in the Graduate Scho ol of Arts and Sciences

Columbia University

Decemb er

c

Jacques Robin

All Rights Reserved

Abstract

RevisionBased Generation of Natural Language Summaries

Providing Historical Background

CorpusBased Analysis Design Implementation and Evaluation

Jacques Robin

Automatically summarizing vast amounts of online quantitative data with a short natural language para

graph has a wide range of realworld applications However this sp ecic task raises a numb er of dicult

issues that are quite distinct from the generic task of language generation conciseness complex sentences

oating concepts historical background paraphrasing p ower and implicit content

In this thesis I address these sp ecic issues by prop osing a new generation mo del in which a rst pass

builds a draft containing only the essential new facts to rep ort and a second pass incrementally revises this

draft to opportunistical ly add as many background facts as can t within the space limit This mo del requires

a new typ e of linguistic knowledge revision op erations which sp ecifyies the various ways a draft can b e

transformed in order to concisely accommo date a new piece of information I present an indepth corpus

analysis of humanwritten sp orts summaries that resulted in an extensive set of such revision op erations I

also present the implementation based on functional unication grammars of the system streak which

relies on these op erations to incrementally generate complex sentences summarizing games This

thesis also contains two quantitative evaluations The rst shows that the new revisionbased generation

mo del is far more robust than the onepass mo del of previous generators The second evaluation demonstrates

that the revision op erations acquired during the corpus analysis and implemented in streak are for the

most part p ortable to at least one other quantitative domain the sto ck market

streak is the rst rep ort generator that systematically places the facts which it summarizes in their

historical p ersp ective It is more concise than previous systems thanks to its ability to generate more complex

sentences and to opp ortunistically convey facts by adding a few words to carefully chosen draft constituents

The revision op erations on which streak is based constitute the rst set of corpusbased linguistic knowledge

geared towards incremental generation The evaluation presented in this thesis is also the rst attempt to

quantitatively assess the robustness of a new generation mo del and the p ortability of a new typ e of linguistic

knowledge

Contents

Intro duction

Starting p oints

The need for summary rep ort generation

Summarization issues

Asp ects of the research

An opp ortunistic generation mo del

A corpus analysis to acquire revision rules

An implemented revisionbased generator

A quantitative evaluation of the opp ortunistic mo del and revision rules

Contributions

Guide to the rest of the thesis

Corpus analysis of newswire rep orts

Motivation and metho dology

Fo cusing on a subcorpus

Rep ort structure and discursive fo cus

Content typ es and semantic fo cus

Domain ontology

Corpus sentence structures

Toplevel structure concept clustering

Internal cluster structure realization patterns

Revision to ols

Dierential analysis of realization patterns identifying revision to ols

Classifying revision op erations

Summary

A new generation architecture

Text generation subtasks

Motivation for a new architecture

From corpus observations to needed abilities i

From needed abilities to design principles

A revisionbased architecture for incremental generation

Internal utterance representation

Pro cessing comp onents and overall control

Generation subtasks revisited

Summary

Implementation prototyp e the STREAK generator

Goals and scop e of the implementation

The internal representation languages

The DSS language

The SSS language

The draft representation

The pro cessing comp onents

The phrase planner

The lexicalizer

The reviser

STREAK at work two example runs

Chaining revisions to generate a complex sentence from a simple draft

Choice of revision rule constrained by the surface form of the draft

STREAK by the numb ers

Evaluation

Evaluation and language generation

The diculty of the issue

The rising interest in the issue

The diversity of the issue

The evaluation approach of this thesis

Quantitative evaluation of robustness

Review of acquisition corpus results

Evaluation goals and test corp ora

Evaluation parameters

Partially automating the evaluation

Results

Quantitative evaluation of p ortability

Starting p oint

Metho dology

Results ii

Related work

Related work in summary rep ort generation

Ana

Systems based on the MeaningText Theory

Other summary rep ort generators

Comparison with STREAK

Related work in revisionbased generation

Yh

weiveR

At what level to p erform revision

Revising to satisfy what goals

Related work in incremental generation

Incremental generation and Tree Adjoining Grammars

Incremental generation in IPF

Related work in evaluation for generation

Conclusion

Contributions

Contributions to summary rep ort generation

Contributions to generation architecture

Contributions to revisionbased generation and incremental generation

Contributions to evaluation in generation

Contributions to SURGE

Limitations and future work

Limitations linked to the use of FUF

Limitations prop er to STREAK

Implementing fact generation and discourse planning

Further evaluations

A A Complete Inventory of Revision Op erations

A Monotonic revisions

A Adjoin

A Absorb

A Conjoin

A App end

A Nonmonotonic revisions

A Recast

A Adjunctization

A Nominalization iii

A Demotion

A Co ordination Promotion

A Side transformations

A Reference adjustment

A Argument control

A Ellipsis

A Scop e marking

A Ordering adjustment

A Lexical adjustment

B Implementation environment the FUFSURGE package

B The FUF programming language

B Functional Descriptions

B Paths

B Functional Grammars

B The SURGE unication grammar of English

B The task of a syntactic grammar

B The clause subgrammar

B The nominal subgrammar

B Rest of the syntactic grammar

B SURGE extensions to complex sentences and quantitative nominals

B Starting p oint

B Extensions to the clause subgrammar

B Extensions to the nominal subgrammar

B Extensions to the rest of the grammar

B Summary

B Limitations of SURGE

B Transitivity

B Adverbials

B Mo o d

B Voice

B Nominals

B Coverage of SURGE

B Test inputs for the extensions at the nominal rank

B Test inputs for the extensions at the complex sentence rank

C Example runs of STREAK

C Original example runs with input and draft representation

C Example run Chaining dierent revision rules to generate a complex sentence iv

C Example run Choice of revision rule constrained by the surface form of the draft

C Additional example runs

C Example run Generating draft paraphrases

C Example run Using revision to generate paraphrases

C Example run Applying the same revision rules in dierent contexts

D Partially automating corpus analysis the CREP software to ol

D A brief overview of CREP syntax

D Using CREP for lexical knowledge acquisition

D Enco ding realization patterns as CREP expressions

D Porting the signature of a revision to ol a detailed example v

List of Figures

A humanwritten newswire summary

The rep ort of Fig stripp ed o its historical information

The facts in the lead sentence of the rep ort in Fig

Paragraph made of simple sentences paraphrasing the complex lead sentence of Fig

Variety of syntactic paraphrases

Traditional sentence generation mo del

Opp ortunistic sentence generation mo del

Paired corpus sentences revealing revision op erations

Standard b oxscore and corresp onding natural language rep ort

Domain ontology

Toplevel structures of basic sentences

Complex sentence with toplevel hyp otactic structure

Concept clusters

Realization pattern examples

Dierential analysis of realization patterns

Revision op eration hierarchy toplevels

Adjoin schema

Adjoin revision op eration hierarchy

Adjunctization schema

Adjunctization revision op eration hierarchy

Syntactic diversity of historical information

Streak information in a corpus rep ort

Internal representation layers

An example DSS dss

sss left and sss right two dierent p ersp ectives on dss

Content units of dss explicitly realized in sss

Content units of dss explicitly realized in sss and sss

sss a third p ersp ective on dss including a rhetorical event

dgs left and dgs right two dierent lexicalizations of sss vi

dgs an example lexicalization of sss

dgs an example lexicalization of sss

An example of threelayered draft representation

A revisionbased generation architecture overall picture

A revisionbased generation architecture draft time

A revisionbased generation architecture revision time

A very simple conceptual network

Representing a at conceptual net as an FD

Example of fact in input format in streak

Example of input FD for basic draft

Example of conceptual network containing an historical record fact

FD representation of the subnet of Fig

FD representing a streak extension

FD representing a streak record up date

FD enco ding of sss

FD enco ding of sss

FD enco ding of sss

Example FD enco ding a paratactically structured SSS

FD representing the draft phrase the defeated the Toronto Raptors

Partial SSS after toplevel unication

Example of phrase planning rules

Enco ding verb choice in streak

Enco ding the nominalization revision to ol in streak

Testing game result material clauses and losing streak extensions

streak generating a simple draft and incrementally revising it into a complex one

streak using dierent revision rules dep ending on the surface form of the draft

Test corpus partitions for conceptual and clustering coverage

Test corpus partitions for paraphrasing coverage

Signature of Adjoin of Classifier w example phrases

Example of nonoverlapping set of target clusters for a given to ol

crep expressions approximating the signature of Adjoin of Relative Clause to Top NP

Denition le for the subexpressions of Fig

Revision op eration hierarchy toplevels

Adjoin revision op eration hierarchy

Conjoin revision op eration hierarchy

Absorb revision op eration hierarchy

Adjunctization revision op eration hierarchy vii

Recast and Nominalize revision op eration hierarchy

Demotion and Promotion revision op eration hierarchy

An example rep ort generated by Ana from Kukich

Example fact and message in ana from Kukich

Example phrasal lexicon entry in ana from Kukich

An example rep ort generated by fog from Polguere

An example rep ort generated by gossip from Carcagno and Iordanska ja

Conceptual Communicative vs Deep Semantic representations the MeaningText Theory

from Carcagno and Iordanska ja

Example text generated by Yh initial and nal drafts from Gabriel

Adjoining in TAGs

Substitution of minimal trees in lexicalized TAGs

Mapping rules in synchronous TAGs

Example inputoutput of a systemic TAG generator

Canonic and noncanonic list representations of the same FD

Unique graph corresp onding to the lists of Fig

A Revision op eration hierarchy toplevels

A Adjoin schema

A Adjoin revision op eration hierarchy

A Absorb schema

A Absorb revision op eration hierarchy

A Conjoin schema

A Conjoin revision op eration hierarchy

A App end schema

A Recast schema

A Recast revision op eration hierarchy

A Adjunctization schema

A Adjunctization revision op eration hierarchy

A Nominalization schema

A App end and Nominalization revision op eration hierarchy

A Demotion schema

A Demotion and Promotion revision op eration hierarchy

A Promotion schema

B Space of FDs

B FD as equation set

B Simple backtracking example viii

B Constituent recursion

B Relative vs absolute paths

B The FD of gures and viewed as a graph

B Unidirectional vs bidirectional unication

B Using control to enco de negation

B Mo dular denition of an FG

B Insertfd pasting a smaller FD into a larger FD

B Relo cate cutting a smaller FD from a larger FD

B Alternative list representations of the same FD

B Enriched sp ecication for the sentence They are winning

B Simple pro cess hierarchy in surge

B Comp osite pro cesses in surge

B Circumstantials in surge

B Mo o d in surge

B NominalInternal grammatical functions and llers in surge

B Example input descriptions of paratactic complexes in surge

B Nuclearity of clause constituents

B Thematic to syntactic role mapping for adverbials in surge

B Predicate Adjuncts in surge

B Sentence Adjuncts in surge

B Disjuncts in surge

B Extensions to the mo o d system

B Example input description with controlled sub ject

B Extensions to the nominal grammar

B Input description of a complex nominal in surge

C Applying a dierent rule at each revision increment

C Form ags used in the example runs

C Input containing the four xed facts for Draft in Run

C List of oating fact lists in Run with rst subelement

C Second oating fact in Run

C Third and fourth oating facts in Run

C Fifth oating fact in Run

C Sixth seventh and eigth oating facts in Run

C DSS layer of nal draft in Run Continued next page

C DSS layer of nal draft in Run Continued from previous page and continued next page

C DSS layer of nal draft in Run Continued from previous page and continued next page

C DSS layer of nal draft in Run Continued from previous page

C SSS layer of nal draft in Run Continued next page ix

C SSS layer of nal draft in Run Continued from previous page

C DGS layer of nal draft in Run Continued next page

C DGS layer of nal draft in Run Continued from previous page

C streak using dierent revision rules dep ending on the surface form of the draft

C Preconditions to using nominalization for losing streak extensions

C DSS layer of b oth drafts in Run

C SSS layer of the rst initial draft in Run

C DGS layer of the rst initial draft in Run

C SSS layer of the second initial draft in Run

C DGS layer of the second initial draft in Run

C streak generating draft paraphrases

C streak generating paraphrases by using a variety of revision rules

C Complex sentence generation where the same revision rule is used at dierent draft levels

D CREP search result for reb ounding statistic verbs

D Realization pattern examples repro duced from Section

D CREP expression E for realization pattern R

D crep subexpression denitions for realization pattern R

D Original signature of Adjoin of CoEvent Clause to Clause in sp orts domain

D Signature of Adjoin of CoEvent Clause to Clause p orted to the nancial domain x

Acknowledgements

1

NEW YORK CPI Jacques Robin wrote a career high pages published pap ers and programmed

over K of fuf and lisp co de to pursue his p ersonal turnaround and lead the French Generation to a

quadruple victory over the PhD Dissertations that allows the based team to threep eat

under the tenure of Coach Kathy McKeown

Robins record p erformance was made p ossible by the outstanding commitment to defense of John Kender

Karen Kukich Steve Feiner and Diane Litman the latter two rep eating their awless p erformance of the last

series in which they played a key role in the Frenchs triumph over the Thesis Prop osals Kukich also tremen

dously help ed at the oensive end with her contagious enthusiasm and her judiciously PLANned Dunks which

came at crucial moments triggering thunderous applause that further demoralized the opp osition

The prevalent wisdom among forecasters during the preseason was that this year the Generation would

struggle even to barely survive the playo qualiers After losing Franky Mr Eciency Smadja the

alltime leader in the pap ers p er minutes played category to an early retirement two seasons ago team

captain Michael The Guru Elhadad just stunned the world by accepting a deep salary cut to sign as a

freeagent with the Negev MetaBarons the very summer after he b ecame the rst player ever to rank in the

top in distinct statistical categories research insights philosophical foundations linguistic erudition

lisp wizardry emacs macro hacking p edagogical excellence consulting exp ertise business acumen familial

nurturing and religious commitment

Coach McKeowns rep eated early warning that her team could threep eat even without Mike was initially

dismissed by many as little more than her latest rhetorical schema to keep her group condent and under

pressure Once more she proved all her critics wrong b ecoming the rst coach to ever mount a successful

title defense without the Most Valuable Phd from the previous year for the second straight time Though

her famous strategical approach copied all over the league has always b een highly rated it is only this year

that she nally gets all due credit for her scouting prowess

Who else could have masterminded the acquisition of Eric Data Siegel who is rapidly evolving as a key

player for the Tetra while his tness measure under the constant pro dding of team aerobic conditioner Tony

Weida sets an example for the rest of the Generation Or dared to b et on Vasileios Hatzi Vassiloglou

whom she snatched from the Carnegie Mellons for a second round pick when he was little more than an ex

Greek League scoring champion with a smoking problem and predicted that staying for longer late night

practices than anyone else it would take him only a few months to b ecome so statistically signicant as to

inspire nothing but fear in the whole Eastern Conference

But b eyond shrewd recruiting it is p erhaps the making of a lateblo oming and unlikely allstar that will

remain as Coach McKeowns most remarkable legacy this year The turning p oint came when the Generation

were riding a four week losing streak and she decided to insert Robin in the starting lineup This b old move

surprised everyb o dy not the least Robin himself It was really the last thing I exp ected he explains I

was worn out by the grueling road trip that we had just completed which included back to back games

against the Qualifying Examinations the Teaching Assistantships the Teaching Requirements and the Area

Pap ers and I was really discouraged by the little research minutes I was seeing o the b ench So when I

entered Coach McKeowns oce that day I just told her that I really felt that I did not have what it takes

to ever b ecome an impact player and thus preferred to quit Leaving her oce a starter after such defeatist

discourse really felt like an hallucination In retrosp ect though her improbable reaction is a p erfect example

of her unique ability to pragmatically make the most of any situation

Head Coach McKeown was not the only p erson Robin wanted to thank in his p ost title defense game

interview I would also like to thank Assistant Coaches Becky Passoneau and Judith Klavans for their

tactical advice teammate Darrin Plaster Cramp Duford for assisting on so many of my p oints and weight

trainer Mel Francis for keeping me strong Together with Israel and Tamar BenShaul Doree Seligmann

Tom ODonnell Paul Michelman Pino Italiano and Pascale Fung they also help ed remind me that even

under the harrowing pressure of the professional level where results are everything one should still try to

have some fun along the way Thanks also to my agents Danilo and Patricia Florissi and Sandra Aluisio

1

Cyb erPress International xi

for getting me the b est p ossible trade to Brazil On the motivational side I must cite the ultimate warrior

Bulen t Pho enix Yener for leading by example with his survival camp mentality and almost winning me

over to the idea that one should never ever give up even in the face of overwhelming and seemingly random

cruel adversity In that resp ect I am also indebted to Pierre Moroni for providing the initial condence

b o ost Finally on the p ersonal side I like would to keep my very most heartfelt thanks to Thierno Mol Jim

Cardoverde Terremotto and Dynamo for their invaluable complicity and ab ove all to yene feker Nunu Challa

and my parents Jean and Jacqueline for their patient unabated unconditional and all around supp ort xii

A la memoire cherie de Lulu xiii

Chapter

Intro duction

Starting p oints

The need for summary rep ort generation

In recent years the volume of information available online has grown exp onentially and several largescale

eorts such as the information sup erhighway and digital library initiatives are currently under way to

further accelerate this growth In order to put this abundance of electronic data to go o d use and avoid

information inundation the development of automatic summarization facilities is critical

Three elds of computer science can contribute to the development of such facilities information retrieval

multimedia presentations and natural language pro cessing Concerning the use of natural language there is

an imp ortant distinction b etween two forms of online information textual form whose content is unrestricted

and tabular form whose content is only quantitative Summarizing free text generally requires unrestricted

natural language interpretation which has remained to o elusive a research goal for immediate application In

contrast automatically pro ducing a natural language paragraph summarizing a large amount of quantitative

data unexploitable in its original exhaustive and tabular form is p ossible to day In fact in a recent round

table on technology transfer Zo ck et al several participants singled out this particular task as the

most promising industrial application of natural language generation Meteorological measurements sto ck

market indexes nancial audit data computer surveillance trails lab or and census statistics sp orts statistics

and hospital patient histories are but a few examples illustrating the p ervasiveness of online data that is

b est exploited when summarized by a short natural language text Indeed one of the only two language

1

generation systems currently in daily use pro duces bilingual weather forecasts for the Canadian weather

2

service

In this thesis I show that generating summaries raises a numb er of dicult issues that are quite distinct

from the generic task of language generation To address these sp ecic issues I prop ose a new generation

mo del where a rst pass builds a draft containing only the essential new facts to rep ort and a second pass

incrementally revises this draft to opportunistical ly add as many background facts as can t within the space

limit This mo del requires a new typ e of linguistic knowledge revision op erations sp ecifying the various

ways a draft can b e transformed in order to concisely accommo date a new piece of information I present

b oth an indepth corpus analysis of humanwritten sp orts summaries that resulted in an extensive set of

3

such revision op erations and the implementation of the system streak which relies on these op erations

to incrementally generate complex sentences summarizing basketball games I also quantitatively evaluate

the advantages of this new revisionbased generation mo del over the onepass mo del of previous generators

1

cf fog Bourb eau et al see Section for details

2

The other system working as a realworld application is plandoc Kukich et al McKeown et al which

pro duces automated do cumentation for managers ab out the choices of telephone network planning engineers at Bellcore

3

Surface Text Reviser Expressing Additional Knowledge

in terms of samedomain robustness and crossdomain p ortability This evaluation used test corp ora in b oth

the sp orts and nancial domains where test data was distinct from the acquisition data

Summarization issues

4

By insp ecting a corpus of sp orts rep orts like the one in Fig I identied seven challenging issues that

summarization raises for language generation systems Most of these issues have b een largely ignored by

previous generators which did not work within space limits These issues include oating concepts

historical background conciseness sentence complexity paraphrasing granularity and

implicit content The rep orts of the corpus I analyzed summarize basketball games In each of these rep orts

the lead sentence itself summarizes the rest of the rep ort Mencher Fensch

Floating concepts

While some concepts consistently app ear in xed lo cations across rep orts eg the nal score of a ballgame

is always conveyed in the second half of the lead sentence others oat app earing p otentially anywhere

5

in the rep ort structure Consider for example the two instances of the streak concept in Fig One

instance b oldfaced on line concerning the Nuggets is conveyed in the lead But the other concerning

the Kings is the very last fact conveyed in the rep ort Floating concepts thus app ear to b e opportunistical ly

realized where the form of the surrounding text allows Consider the Kings losing streak in Fig Instead

of the closing sentence it could have alternatively b een attached to either the lead sentence or the second

sentence since b oth of them also contain a reference underlined to the Kings The particular choice of the

closing sentence seems to b e motivated by stylistic surface form factors the lead sentence is already complex

enough and the reference to the Kings in the second sentence is to o deeply emb edded to b e the ob ject of

an app osition Similar considerations probably explain the choice of a separate sentence the third in the

rep ort as opp osed to attachment to the lead for the remaining facts concerning the Nuggets

The exibility with which oating facts can b e conveyed is an asset for a summarization application

among the various forms and lo cations p ossible for their inclusion in the rep ort the most concise one can b e

chosen However for a generator this exibility represents a diculty it requires searching a much larger

space of expressive options The basic idea underlying the draft and revision approach prop osed in this

thesis is to use the inherent rigidity of the xed facts as constraints to reduce the search space of options for

the expression of the oating ones

Historical background

Although optional in any given sentence oating concepts cannot b e ignored in the corpus I analyzed they

account for over of total lead sentence content But what makes oating concepts even more crucial

than this p ervasiveness is that they cover entire content typ es One such typ e is historical information

ie past facts related to the new rep orted ones which explain their relevance The imp ortance of such

background facts is illustrated in Fig where they are b oldfaced Omitting them would result in a much

imp overished rep ort as shown in Fig where the same rep ort has b een stripp ed from these historical

facts In the corpus I analyzed of the lead sentences contained some historical fact Yet previous

summary rep ort generation systems were not capable of including such background information in their

rep orts due to its oating nature Conveying such facts requires a generator to access an historical database

to supplement the table of new statistics it receives as input But it allows the generator to contextualize

its input data in addition to summarizing it Readers do not merely want to know what happ ened but also

why it is interesting

4

Taken from the UPI newswire

5

ie information ab out series of similar outcome

Conciseness

A generator can p erform two dierent typ es of summarization conceptual summarization by selecting the

essential facts and linguistic summarization by expressing the selected facts in compact surface form Consid

ering the historical background dramatically increases the numb er of candidate facts to include in the rep ort

and thus emphasizes the need for linguistic summarization When limited space is available background

facts cannot b e conveyed by separate sentences Instead as in Fig they can b e concisely expressed by

small phrases sometimes down to a single word eg rookie in front of the last sentence woven into the

expression of the new facts which provide the rep ort structure backb one This is always p ossible since an

historical fact is relevant only if it is related to some new fact to rep ort Previous rep ort generators fo cused

either on p erforming conceptual summarization andor on a form of linguistic summarization limited to

clause combining and anaphora They therefore failed to exploit the full p otential for conciseness laying

inside each clause

Sentence complexity

6

A ma jor strategy to achieve conciseness is to use complex sentences This is why sentences in newswire

rep orts tend to b e very long esp ecially lead sentences which themselves summarize the rest of the rep ort by

packing together all the crucial facts For example in Fig the lead sentence alone conveys ten facts An

intuitive logical form representing these facts is given in Fig

The compactness of complex sentences is illustrated in Fig It contains a multisentence paragraph

that paraphrases the lead sentence of the rep ort of Fig In this paragraph each of the ten facts packed

inside this lead sentence is conveyed by an indep endent simple sentence This rep ort is an example of

what a system p erforming only conceptual summarization and a form of linguistic summarization limited to

anaphora can generate In numb er of words it is twice the length vs of the synonymous complex

sentence Complex sentences are concise b ecause by grouping together several facts they can factor out

their common content units For example there are ve o ccurrences of the unit michaeladams in the

ten facts of Fig Grouping these ten facts in the complex lead of Fig allow collapsing these ve

o ccurrences in a single referring NP

7

The table b elow contrasts the complexity of two previous summary rep ort generators gossip Carcagno

and Iordanska ja and ana Kukich with the complexity of the lead sentences in the corpus I

analyzed Its rows indicate the numb er of facts represented in logical form as in Fig parse tree depth

8

and numb er of words of a sentence For gossip and ana these numb ers are based on the few example

9

rep orts provided in publications ab out these systems

Previous systems Humanwritten

gossip ana lead sentences

max max min max

Factual density

Syntactic depth

Lexical length

This table p oints out a complexity gap b etween sentences generated by these systems and the ones

observed in humanwritten newswire summaries One reason why these two systems could pro duce go o d

summaries in their resp ective domains with simpler sentences is that they excluded references to historical

facts from their target sublanguage Other systems that generate rep orts that are not sp ecically summaries

can avoid generating very complex sentences since they are under no pressure to b e concise Two such

systems Danlos generator Danlos and pauline Hovy generate sentences of a complexity

6

As paradoxical as it may rst sound

7

cf Section for details on these two systems

8

Counting b oth op en and closed class words

9

Though by no mean a rigorous systematic comparison it nonetheless gives an estimate that is fair enough for the p oint I

make here

Sacramento Ca Michael Adams scored a careerhigh p oints Wednesday night including seven p oint

baskets to help the shorthanded end a vegame losing streak with a victory

over the

Adams who was drafted and then discarded by the Kings four seasons ago made of eld

goals including seven of p oint attempts and hit three of four free throws to break his previous

career high of p oints

The Nuggets who had only eight players available for the game improved to on the road and

overall

Ro okie guard playing his second game after missing with back spasms scored a

losers of four in a row seasonhigh p oints for the Kings

Figure A humanwritten newswire summary

Sacramento Ca Michael Adams scored p oints Wednesday night including seven p oint baskets to

help the shorthanded Denver Nuggets to a victory over the Sacramento Kings

Adams made of eld goals including seven of p oint attempts and hit three of four free throws

The Nuggets had only eight players available for the game Guard Travis Mays scored p oints for the

Kings

Figure The rep ort of Fig stripp ed o its historical information

F locationsacramentoCA

F scoringmichaeladamspt

F newhighFscoringmichaeladamsXptlifetimemichaeladams

F scoringmichaeladamspointer

F missplayersdenvernuggets

F teammichaeladamssacramentokings

F beatdenvernuggetssacramentokings

F streakendFdenvernuggetsloss

F score

F timeWednight

Figure The facts in the lead sentence of the rep ort in Fig

Sacramento Ca Michael Adams scored p oints This scoring p erformance was the b est of his entire

career It included seven p oint baskets Michael Adams plays for the Denver Nuggets This team was

missing many players for the game But with the p erformance of Michael Adams Denver managed to defeat

the Sacramento Kings The Nuggets had lost ve consecutive times b efore this game The nal score was

The game was played Wednesday night

Figure Paragraph made of simple sentences paraphrasing the complex lead sentence of Fig

Coordinated clause

David Robinson scored p oints Friday night lifting the to a victory over

Denver and handing the Nuggets their seventh straight loss

Qualifying nonfinite clause in toplevel nominal

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a victory over

Denver sending the Nuggets to their seventh straight setback

Qualifying relative clause in toplevel nominal

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a victory over

the Nuggets that extended Denvers losing streak to seven games

Qualifying relative clause in embedded nominal

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a victory over

the Denver Nuggets who lost for the seventh consecutive time

Apposition to toplevel nominal

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a victory over

Denver the Nuggets seventh straight defeat

Apposition to embedded nominal

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a victory over

the Denver Nuggets losers of seven in a row

Figure Variety of syntactic paraphrases

comparable to those of ana which constitutes an upp erb ound for previous work However the comp ound

need to b e concise and to convey historical background requires generating more complex sentences But

planning the content of a single sentence of such complexity b ecomes at least as dicult as planning the

content of an entire paragraph made of simple sentences This is illustrated by the paragraph in Fig

which conveys the same facts as the lead sentence in Fig

Though complex sentences allow concise expression of multiple facts and are thus very informative

b eyond a certain p oint they can also b ecome unreadable A summary rep ort generator thus needs to carefully

monitor sentence complexity to stay within the readability threshold However such thresholds can only

b e dened in terms of surface form factors such as numb er of words or depth of emb edding Planning a

maximally complex sentence such as those observed in newswire summaries can only b e done under surface

form constraints The standard content planning techniques devised for paragraphs such as textual schemas

McKeown and Rhetorical Structure Theory Hovy cannot b e used for the complex sentences of

summaries precisely b ecause they op erate purely at the conceptual level

Paraphrasing p ower

Conveying oating facts concisely requires attaching them opp ortunistically where the surrounding text

allows These facts thus need to b e expressible by a wide variety of syntactic constructs each suitable to a

particular textual context Even in a xed context surface form variation is also needed one of the surest

ways for a generation system to b etray its articial nature is to always output the same linguistic form when

given the same typ e of information in input

Paraphrasing p ower is illustrated for example in the rep ort of Fig where the same streak concept

is expressed by a VP in the lead sentence b oldfaced on line and an NP in the closing one the very

last constituent of the rep ort In the corpus I analyzed over dierent syntactic constructs were used to

express this concept A sample of those constructs is shown in Fig where the same streak concept

highlighted in b old is conveyed by a dierent construct in each of six synonymous sentences

Note that these constructs dier not only in terms of the syntactic category of the constituent conveying

the streak eg clause in vs NP in and the syntactic device linking this constituent to the rest

of the sentence eg co ordination in vs emb edding in but also in terms of the level at which

this device is used inside the sentence structure In each example the sentence constituent onto which the

streak fact is attached is highlighted in italics eg while in and a relative clause and an app ositive

nominal are resp ectively used to mo dify the NP expressing the whole game result in and these same

two devices are used to mo dify the NP referring only to the losing team which is emb edded deep er in the

syntactic structure

Granularity

A crucial feature of a generator is the granularity at which it translates its conceptual input into a linguistic

output The coarse end of the granularity range includes generators that rely on a phrasal lexicon where

entries are entire phrases each simultaneously expressing several content units The sentences generated

by such systems are thus macro co ded from two to four phrasal entries At the other end of this range are

10

generators that rely on a wordbased lexicon where entries consists of individual words The sentences

generated by such systems are thus micro co ded from words conveying only one or two content units

gossip and ana cf table of Section illustrate these two extremes While b oth generate sum

maries the former is representative of the class of micro co ded generators that also includes fog Bourb eau

et al lfs Iordanska ja et al kamp App elt kalipsos Nogier fn Reiter

epicure Dale igen Rubino comet McKeown et al avdisorI I Elhadad b and

plandoc Kukich et al while the latter is representative of the class of macro co ded generators that

also includes Danlos generator Danlos phred Jacobs pauline Hovy semtex Ro esner

and weiver Inui et al

The fact that ana generates more complex sentences than gossip generalizes to the resp ective classes of

generators to which they b elong Thus existing generators either micro co de simpler sentences or macro co de

more complex sentences Generating the kind of sentences of the corpus I analyzed however requires not

only generating even more complex sentences but also micro co ding them The main reason for this necessity

is that micro co ding is more comp ositional and hence makes paraphrasing p ower easy to scale up with a few

additional single word entries To attain a similar paraphrasing p ower with macro co ding a combinatorially

explosive numb er of phrasal entries need to b e added to the lexicon As in the case of cho osing where and

how to convey oating facts micro co ding a very complex sentence requires exploring a much vaster search

space This is another motivation for the draft and revision mo del I advo cate in this thesis It decomp oses

this overall search into a initial draft phase followed by a set of incremental revision steps

Implicit content

Another go o d strategy for achieving conciseness is to convey content implicitly By implicit content I mean

information recoverable though not sp ecically expressed by any word or linguistic constituent Such content

can b e recovered either from the very structure of the sentence ie by how the explicitly expressed content

units are organized inside of it andor by domain reasoning Consider again the lead sentence of Fig

and the corresp onding facts in Fig The fact F that Adams plays for Denver is not explicitly conveyed

6

in this lead It is nonetheless immediately recoverable from the use of the verb to help linking the main

statistic of the game Adams p erformance to its result Denvers win and the domain knowledge that

11

a player scoring p oints can only help his own team to win the game The expression of F in this

lead sentence is thus implicit scattered into the meaning of severals words each of which indep endently

conveys in itself another content unit In contrast in the paragraph of Fig F is explicitly conveyed by

10

Or short multiword collo cations containing only a single op enclass word for example to make up with where the verb

to make is the only op enclass item would b e a single entry while to make a mistake would need to b e built from the two

entries to make and mistake

11

Had the player scored only one p oint these two facts could not have b een linked by to help while or as expressing

a mere coo ccurrence would b e needed instead

a separate sentence Such implicit content is more frequent in complex sentences packing in many related

facts than in simple ones

This issue inuenced the design of the generator streak esp ecially in terms of the internal representation

of the draft Generating implicit content however involves many issues of its own and this thesis only

12

scratches the surface

Asp ects of the research

The research presented in this thesis addresses the challenging language generation issues describ ed in the

previous section It has four main asp ects

A mo delling and design asp ect an opp ortunistic generation approach

A linguistic knowledge acquisition asp ect a corpus analysis resulting in a set of revision rules to

incrementally generate complex sentences from simple ones

An implementation asp ect the streak generation system

A quantitative evaluation asp ect measures of robustness and p ortability

I survey these asp ects in turn in the following subsections

An opp ortunistic generation mo del

The rst asp ect is the design of a new language generation mo del How the overall generation pro cess can

b e decomp osed into mo dules is still a matter of op en debate among researchers in the eld However the

design of most generators built for practical applications essentially varies on a common theme cf Reiter

Temp orarily ignoring the multisentential asp ect of some generators to fo cus on sentence generation

I revisit in detail the design issues intro duced here within the larger framework of full text generation in

Section this common theme is outlined in Fig

In this traditional sentence generation mo del a comp onent interfacing the generator with the underlying

application program pro duces a sp ecication of the content to convey In the case of summary rep ort

generation from quantitative data this interface is a fact generator that retrieves interesting data from

tables of numb ers and reformats them as conceptual structures suitable for language generation For other

applications this interface may query a database the trace of an exp ert system the interlingua representation

of a text to translate etc

The resulting content sp ecication is passed to a lexicalization comp onent This comp onent maps the

content sp ecication into a linguistic sp ecication of the sentence expressing this content It cho oses the

13

basic syntactic structure of the sentence as well as its op enclass lexical items This linguistic sp ecication

is then passed to a syntactic grammar that enforces syntactic rules such as agreement cho oses closedclass

words inect op enclass words and linearize the syntactic structure into a natural language string

The lexicalizer generally builds the syntactic structure in topdown recursive fashion following the algo

rithm b elow

Cho ose a concept to head the sentence structure

12

In particular I do not discuss its relation to adjacent topics such as usermo deling Paris conversational implicatures

Reiter or argumentation Elhadad b

13

Lexical items are traditionally divided into a op enclass lexical items also called cognates such as nouns verbs adjectives

and adverbs and b closedclass lexical items also called function words such as articles pronouns and conjunctions Op en

classes are large and seemingly ever growing while closedclasses are small and stable Distinguishing elements in an op enclass

requires semantics while in a closedclass it can b e done on syntactic grounds only In that sense prep ositions which are few

but cannot b e distinguished on purely syntactic grounds cf Herskovits are neither really an op en nor a closed class of

words but a little bit of b oth Application Program Interface

Full Content Specification

Lexicalizer

Linguistic Specification

Syntactic Grammar

Natural Language Utterance

Figure Traditional sentence generation mo del

Application Program Interface

Obligatory Content Specification Supplementary Content Specification

Phrase Planner

Lexicalizer

Draft Linguistic Specification Reviser

Syntactic Grammar

Natural Language Utterance

Figure Opp ortunistic sentence generation mo del

Cho ose a verb lexicalizing this head concept

Cho ose which other concepts to map onto each argument of this verb

Cho ose a sentential adjunct for each remaining concept

Recursively cho ose a the syntactic category b head word and c internal structure of each verb

argument and sentential adjunct

The numb er of recursions needed to generate a sentence using this algorithm dep ends on b oth the granu

larity of lexicon phrasal or wordbased and the complexity of b oth the content and linguistic sp ecications

To illustrate this algorithm with an example consider generating with a wordbased lexicon the sen

tence

Michael Adams scored points Wednesday night

from the content sp ecication gamescoringmichaeladamspttimeWednight

For this example step involves cho osing scoring as sentence head as opp osed to say time which would

result in the alternative output The game in which Michael Adams scored points was played Wednesday

night Step involves cho osing the verb to score instead of say to have or to nish with Step

involves cho osing michaeladams as the agent of to score and pt as its patient Step involves

mapping Wednight onto a time adjunct To illustrate step consider for instance the recursive treatment

of the patient argument a involves cho osing an NP b cho osing as the head of this NP and c

cho osing to map as a cardinal premo dier

When generating a simple sentence that conveys only xed facts and for which only a few paraphrases

exist as in the example ab ove each step in this algorithm involves a few candidates and constraints and

is fairly selfcontained And without a space limitation the more problematic oating facts can always b e

conveyed in a subsequent separate sentence For example the sentence ab ove could b e followed by This

scoring performance was the best of his entire career

This is not the case for generating the complex lead sentences of summaries that concisely pack essential

xed facts and related background facts and for which there are many alternative paraphrases For such

sentences the steps in the algorithm ab ove would involve far to o many candidates and constraints to remain

manageable and also b ecome overly interdep endent An example of such complex sentences where back

ground facts are highlighted in b old generated by the streak prototyp e is given b elow

TX matched his season record with p oints and came

o the b ench to add Friday night as the Pho enix Suns handed the their franchise

worst th defeat in a row at home

I prop ose to generate such complex sentences in two passes In the rst pass a simpler sentence contain

ing only the xed facts is generated This simpler sentence serves as draft material for the second pass in

which oating facts are considered in turn by order of imp ortance These oating facts are then opportunis

tical ly woven into the draft through incremental revisions Where how and even whether a oating fact is

incorp orated is constrained by the surface form of the draft

This new generation mo del shown in Fig diers from the traditional mo del of Fig by three

essential prop erties

It decomp oses the content to convey into an obligatory part the xed facts and a supplementary part

the oating facts handling the former prescriptively and the latter opp ortunistically

It decomp oses building the linguistic sp ecication into several incremental steps each one incorp orating

a supplementary content unit

It decomp oses building the initial linguistic draft into two distinct pro cesses phrase planning where

content units are organized inside the sentence structure and lexicalization where individual words

are chosen for each element in that structure

The rst prop erty means that the generator itself instead of the underlying application program has

the nal say ab out what supplementary content to actually include in the sentence allowing linguistic

factors to inuence content determination This is essential to pack as many facts as p ossible in complex

sentences while ensuring that they remain readable The second prop erty multiplies mo dularity by having

the generator taking the steps in the algorithm ab ove at each recursion and for each increment Mo dularity

is also enhanced by the third prop erty that separates the content organization and the linguistic realization

asp ects of sentence generation that the traditional algorithm integrates The phrase planner is in charge of

steps and c content organization leaving steps a and b for the lexicalizer linguistic realization

This separation is motivated by the two factors b elow

The organization of content can b e lexical ly driven only for linguistic constituents at and b elow the

simple clause rank At the complex sentence rank just as at the paragraph rank in the absence

of a head verb whose argument structure predenes the mapping from content units onto linguistic

constituents this organization can only b e rhetorical ly driven

In humanwritten summaries even the most basic sentences containing only xed obligatory facts are

multiclause sentences with several adjuncts They are thus already at the level of complexity requiring

internal rhetorical organization

14

Note that there is no feedback from the lexicalizer to the phrase planner Cases where rhetorical and

lexical constraints interact are b est handled by the reviser

Using this mo del the complex example sentence given ab ove is generated in the six following steps the

supplementary content unit added at each increment is highlighted in b old

Dallas TX Charles Barkley registered p oints Friday night as the Pho enix Suns routed the Dallas

Mavericks

Dallas TX Charles Barkley registered p oints Friday night as the Pho enix Suns handed the Dallas

Mavericks their th defeat in a row at home

Dallas TX Charles Barkley registered p oints Friday night as the Pho enix Suns handed the Dallas

Mavericks their franchise worst th defeat in a row at home

Dallas TX Charles Barkley matched his season record with p oints Friday night as the Pho enix

Suns handed the Dallas Mavericks their franchise worst th defeat in a row at home

Dallas TX Charles Barkley matched his season record with p oints and Danny Ainge added

Friday night as the Pho enix Suns handed the Dallas Mavericks their franchise worst th defeat in

a row at home

Dallas TX Charles Barkley matched his season record with p oints and Danny Ainge came o

the b ench to add Friday night as the Pho enix Suns handed the Dallas Mavericks their franchise

worst th defeat in a row at home

From an AI standp oint this mo del can b e viewed as a heuristic for dealing with the growth of the space

of p ossible linguistic options to search during the generation pro cess in summarization applications This

growth is caused by the comp ound eects of the necessity to handle oating background facts to rep ort

new facts in their context to use complex sentences for their conciseness and to use a wordbased lexicon

for its scalability

But this revisionbased opp ortunistic approach to generation is not only preferable from a system en

gineering standp oint An exp eriment by Pavard describ ed in Section suggests that it is also a

more cognitively plausible mo del for the generation of complex written sentences

A corpus analysis to acquire revision rules

In order to handle supplementary content opp ortunistically the new generation mo del describ ed ab ove re

quires the acquisition of a new typ e of linguistic knowledge structure revision op erations sp ecifying the

14

Much publicized in the research literature App elt Danlos Meteer Rubino yet of marginal

signicance in most practical applications as p ointed out by Reiter

Complex sentence S containing two oating facts

2

TX Buck Johnson scored a season high p oints Thursday night and the Houston Ro ckets

routed the Orlando Magic for their sixth straight win

Sentence S with one less fact than S in the second clause

1 2

Minneap olis MN Po oh Richardson scored a career high p oints Saturday night and the Minnesota

Timb erwolves b eat the

a

Sentence S with one less fact than S and S in the rst clause

0 1

1

Chicago IL scored p oints Saturday night and the defeated the Cleveland

Cavaliers

a

Sentence S with one more fact than S but an only partial ly overlapping syntactic structure

0

1

Hartford CT scored p oints Friday and the handed the their

sixth straight home loss

Figure Paired corpus sentences revealing revision op erations

structural transformations that a given draft must undergo to incorp orate a given new piece of content In

order to abstract these transformations as well as the semantic syntactic and lexical constraints that deter

mine their application I analyzed down to the individual word a corpus of lead sentences from basketball

game summaries compiled over a season of play This corpus analysis and the negrained linguistic data

that resulted from it constitute the second asp ect of the research presented in this thesis

The main idea driving this corpus analysis was to pair sentences that dier semantically by a single

oating fact and identify the minimal syntactic transformation b etween them The example corpus sentences

in Fig illustrate how I carried out this analysis In the sublanguage of basketball summaries the xed

15

facts shared by all sentences are the most signicant statistic by a winning team player the game result

nal scoreline lo cation and date I started from the most complex sentences in the corpus ie those

containing several oating facts in addition to the xed facts Sentence S is an example of such a sentence

2

It contains two background facts in b oldface One app ears in the rst clause and explains what makes the

player statistic conveyed by that clause signicant it is a record The other app ears in the second clause

and it relates the game result conveyed by that clause to the winners previous results they together form

a streak I then searched the corpus for sentences that lacked one of these two oating facts but otherwise

followed a syntactic structure as close as p ossible to that of S S is an example of such sentence It diers

2 1

from S only in that it do es not include a streak and lacks the trailing forprep ositional phrase Sentence

2

pair S S thus reveals that streaks can b e opp ortunistically added to the rep ort by attaching such

2 1

forPPs I then pro ceeded to search for sentences with one less oating fact than S S is an example of

1 0

such sentence It diers from S only in that it do es not convey a record and lacks a premo dier in front of

1

the statistic NP Sentence pair S S thus reveals that records can b e slipp ed into a rep ort by adding

1 0

such NP mo diers

I rep eated this pairwise analysis for each distinct sentence structure It resulted in a set of revision

operations sp ecifying precise semantic and syntactic constraints on where a particular typ e of oating

fact can b e added in a draft and what linguistic constructs can b e used for the addition I classied

this set of revision op erations in a hierarchy The top of this hierarchy distinguishes b etween monotonic

and nonmonotonic revisions The revisions abstracted from the two pairs discussed ab ove are monotonic

the structure of the more complex sentence fully includes that of the simpler one Such revisions allow the

oating fact to b e added without disturbing the rest of the sentence However the structure of some complex

sentences do es not fully overlap with any of the simpler sentence structures conveying one less fact This

a a

is the case for the example of sentence S in Fig S contains one less fact than S and has a similar

0

1 1

toplevel structure and rst clause In S however the game result is expressed verbally by to defeat while

0

15

ie who won and who lost

in S it is expressed nominally by a loss The revision abstracted from this pair is thus nonmonotonic in

1

order to accommo date the additional streak fact as a nominal mo dier in b old the expression of the game

result has b een preallably nominalized This typ e of nonmonotonic revision is one of the most interesting

results of the corpus analysis I carried out for this thesis It empirically conrms the intuition that concisely

accommo dating a new piece of content into a draft sometimes require changing the expression of the original

content in the draft

In addition to revision rules phrase planning and lexicalization rules were also acquired during the

corpus analysis All three were compiled using a corpus analysis metho dology presented in Section This

metho dology is applicable to any domain It allows basing the denition of the range of expressive options

that need to b e implemented in the generator on systematic empirical data

An implemented revisionbased generator

The third asp ect of the research presented in this thesis is the development of the prototyp e system streak

implementing the new draft and revision mo del of language generation This implementation demonstrates

the op erationality of b oth the new generation mo del and the new typ e of linguistic knowledge acquired

during the corpus analysis

For this implementation I built on a preexisting software environment dedicated to the development of

language generation systems the fufsurge package Elhadad a Elhadad b fuf Functional

16

Unication Formalism is a programming language based on functional unication Kay Both the

input and the output of a fuf program are features structures called Functional Descriptions FDs The

program itself called a Functional Grammar FG is also a feature structure but one which contains

disjunctions and control annotations The output FD results from the unication of this FG with the input

FD The disjunctions in the FG make unication nondeterministic

Since it implements the mo del of Fig streak consists of the four comp onents shown in that gure

Each one of them consists of an interpreter and a declarative knowledge source for a total of eight mo dules

The fufsurge package provided three out of four interpreters and most of one the knowledge sources

The interpreters for the phrase planner lexicalizer and syntactic grammar rely on the topdown recursive

unication mechanism builtin in fuf This mechanism is inherently monotonic Since the reviser needs to

p erform the nonmonotonic revision op erations identied during the corpus analysis I develop ed a dierent

interpreter for the reviser It relies on new cut and paste op erations implemented using systemlevel fuf

17

functions

fuf comes as a package with surge a syntactic grammar of English usable as a frontend p ortable across

generation applications The version of surge available when I started the implementation of streak had

a widecoverage at the simple clause and determiner ranks but not at the complex sentence and nominal

ranks Drawing b oth on a variety of noncomputational descriptive linguistic works such as Quirk et al

and on the syntactic data compiled during the corpus analysis I extended surge at these two ranks

At the complex sentence rank the coverage of the resulting surge version of the grammar go es way

b eyond the sp ecic sp orts sublanguage needed for streak

These two extensions to the fufsurge package the revision rule interpreter and surge were

essentially preparatory work paving the way for the core implementation of streak This core consisted

of enco ding the linguistic data compiled during the corpus analysis as three declarative knowledge sources

the phrase planning rule base the lexicalization rule base and the revision rule base each represented as an

FG

Both the example of incremental complex sentence generation given in Section and the example

of syntactic paraphrasing given in Fig are actual runs of streak Note the nonmonotonic revisions

from steps to and to where the verbs in italics to rout and to register are resp ectively deleted

16

fuf is implemented in Common Lisp

17

The most complex of these functions have b een implemented sp ecially for the development of streak by fufs creator

MElhadad to whom I am therefore deeply indebted

without loss of content to make ro om for additional background facts

A quantitative evaluation of the opp ortunistic mo del and revision rules

The fourth and last asp ect of the research presented in this thesis is a quantitative evaluation It estimates

the samedomain robustness and crossdomain p ortability of b oth the new generation mo del and the new

linguistic knowledge structures on which this mo del is based The initial corpus from which the linguistic

knowledge structures were acquired covered a season of basketball rep orting To assess their robustness I

used two subsequent years of basketball rep orting as test corp ora To assess their p ortability across domains

I used a corpus of sto ck market rep orts as a test

I evaluated two asp ects of robustness coverage and extensibility I rst dened a set of parameters

assessing the coverage of the target sublanguage for the application domain of streak that could b e attained

by analyzing a one year sample of that sublanguage Dierent parameters were used for measuring the impact

on such coverage limit of using

The knowledge structures needed by the onepass macro co ded mo del of previous systems generating

multiclause sentences

The knowledge structures needed by the draft and revision micro co ded mo del prop osed in this thesis

I also orthogonally dened dierent parameters for dierent typ es of coverage conceptual rhetorical and

paraphrasing p ower in addition to the overall realization coverage I then dened a set of similar parameters

but this time for measuring the extensibility of the resp ective approaches The coverage parameters answered

the question with the knowledge structures acquired by analyzing a year sample of the sublanguage how

many sentences from a dierent year sample could b e generated In contrast the extensibility parameters

answered the question how many new knowledge structures are needed by a generator that was based on

one year sample in order to also fully cover a dierent year sample The results of this evaluation given

in Section shows that the new generation mo del prop osed in this thesis dramatically pushes back the

coverage limit of the sublanguage studied It also though less sp ectacularly improves extensibility

Having established the samedomain robustness of a generator using the revision rules acquired from the

original sp orts domain I then evaluated the p ortability of these rules to another domain I thus carried out

on a corpus of sto ck market rep orts the same pairwise analysis that I used to abstract these rules on the

a

original corpus For example consider again the sentence pair S S of Fig repro duced b elow

0

1

a

S the Utah Jazz handed the Boston Celtics their sixth straight home loss

1

S the Chicago Bulls defeated the

0

from which the revision rule Nominalization with Ordinal Adjoin was abstracted

b b

The following phrase pair S S from the test corpus shows that the same rule can b e used for

1 0

incrementally generating sto ck market rep orts as well

b

S the Tokyo sto ck market p osted its third consecutive decline

1

b

S the Tokyo sto ck market plunged

0

a b

As in the case S S ab ove the result verbally expressed in S by to plunge is nominalized as

0

1 1

its decline in order to b e subsequently premo died by the ordinal third consecutive which conveys the

additional streak information

The revision rules were classied as a hierarchy The p ortability evaluation shows that of the

branches in this hierarchy are p ortable Given the conceptual distance b etween the sp orts and nancial

domains these results suggest that b oth the generation mo del and linguistic data presented in this thesis

could b e successfully reused to develop summarization systems in multiple quantitative domains such as

those listed at the b eginning of this intro duction

Contributions

The research presented in this thesis makes eight contributions to the eld of natural language generation

Conveying the historical context in rep ort generation

Planning and realizing complex sentences

Making the generation pro cess more exible by allowing decisions to b e made opp ortunistically and

under a wider range of constraints

Making the generation pro cess more comp ositional by building sentences incrementally and from indi

vidual words

Abstracting revision op erations for incremental generation

Extending SURGE to complex sentences and quantitative nominals

Quantitatively comparing generation mo dels

Quantitatively evaluating the p ortability of knowledge structures used for generation

I briey elab orate on each of them in the rest of this section

Conveying the historical context in rep ort generation streak is the rst summary generator to

provide the historical background of the new facts it rep orts It thus not only summarizes a particular event

but also contextualizes it This constitutes a signicant step towards bridging the gap b etween computer

generated rep orts and their humangenerated counterparts Conveying historical information allows for a

much broader coverage A generator ignoring such information could at b est cover of the sentences from

the corpus of humanwritten summaries I analyzed Taking into account the historical context also allows

for a smarter choice of the new facts to rep ort since the relevance of a new fact is largely dep endent on its

historical signicance

Planning and realizing complex sentences streak is the rst generator that delib erately attempts

to pack as many facts as p ossible within a single sentence As a result streak generates sentences up to the

maximum complexity level observed in humanwritten summaries words long level deep conveying

facts These sentences are signicantly more complex than those generated by any previous system The

development of streak required addressing for the rst time at the sentence rank the full blown content

planning issues previously investigated only at the paragraph rank

Making the generation pro cess opp ortunistic and more exible streak is the rst system to han

dle obligatory content prescriptively and supplementary content opp ortunistically and under surface form

constraints This makes the generation pro cess far more exible In particular it allows using linguistic fac

tors to monitor content planning a requirement when concisely expressing the content of a whole paragraph

in a single sentence without letting that sentence grow so complex as to b e unreadable

Making the generation pro cess more comp ositional and scalable streak generates sentences more

comp ositionally than any previous generators It is comp ositional in three orthogonal ways by building

sentences from individual words by separating the organization of content units inside the sentence from the

lexicalization of the individual units and by building sentences incrementally starting from a basic draft into

which additional content units are gradually incorp orated This extreme comp ositionality makes streak

more easily scalable and p ortable

Abstracting revision op erations for incremental generation The hierarchy of revision op erations

presented in this thesis constitutes a new typ e of linguistic knowledge It is the rst set of linguistic resources

to b e simultaneously based on an extensive corpus analysis and sp ecically geared towards incremental

language pro cessing Their op erationality is demonstrated by their implementation in the revision rule base

of streak Their relevance to other quantitative domains is shown by their high degree of p ortability from

the domain where they were acquired to a new domain

Extending SURGE to complex sentences and quantitative nominals surge is easy to use thanks

to the uniform bidirectional and declarative formalism of FUGs on which it is based The extensions I made

to surge at the nominal and complex sentence ranks endow it with wide coverage at all four sentential

linguistic ranks determiner nominal simple clause and complex sentence This extended coverage combined

with its easy of use makes surge the b est p ortable syntactic pro cessing frontend for the development of

generation applications available to day

Quantitatively comparing generation mo dels The robustness evaluation I carried out is the rst

attempt to quantify how much of a given sublanguage can b e captured by various knowledge structures

used for language generation and acquired from samples of this sublanguage It is also the rst quantitative

comparison of dierent generation mo dels It establishes the sup eriority of the new micro co ded revision

based generation mo del over the traditional onepass macro co ded generation mo del of previous systems

for the generation of multiclause sentences There is hardly any previous work in quantitative evaluation

metho ds for generation

Quantitatively evaluating the p ortability of knowledge structures used for generation The

p ortability evaluation of the revision op erations describ ed in this thesis is the rst attempt to quantitatively

assess the domainindep endence of knowledge structures used for generation and acquired from texts in a

single domain It establishes the relevance of b oth these revision op erations and the opp ortunistic generation

mo del that relies on them to other quantitative domains

Guide to the rest of the thesis

In the next chapter I describ e the original corpus analysis that resulted in the hierarchy of revision op erations

needed by the new generation mo del prop osed in this thesis In the subsequent chapter I present in detail

a complete system architecture for text generation based on this mo del In Chapter I turn to the imple

mentation of this mo del in the system streak In Chapter I present the quantitative evaluations of the

samedomain robustness and crossdomain p ortability of this revisionbased generation mo del In Chapter

I compare the research presented in the previous chapters to related work in the elds of summary rep ort

generation revisionbased generation incremental generation and evaluation in generation In Chapter I

conclude by revisiting the contribution this thesis makes to each of these elds and discussing its limitations

and future directions within the new research framework that it denes In App endix A I present in full

detail the revision rule hierarchy abstracted from the corpus analysis of Chapter In App endix B I describ e

the fufsurge package underlying the implementation of streak including the extensions to the syntactic

grammar In App endix C I provide a set of example runs of streak some of them together with the full

input and intermediate representations used by the system during the run Finally App endix D gives an

18

overview of crep a software to ol that I used for partially automating the various corp ora analysis that I

p erformed during the knowledge acquisition and evaluations stages of this research

18

And designed in collab oration with Darrin Duford who implemented it

Chapter

Corpus analysis of newswire rep orts

Motivation and metho dology

1

In this chapter I present a corpus analysis of newswire rep orts summarizing basketball games An example

rep ort from this corpus was given in Fig of Section Another example is given in Fig in Section

The corpus consisted of rep orts covering an entire professional basketball season Almost all of them

contained some historical information

The overall goal of this corpus analysis was threefold

Identify the set of linguistic resources available to convey sp ecically historical information

Provide criteria for a generation architecture able to concisely and exibly convey historical information

Provide data for the knowledge sources of a prototyp e system based on such an architecture

The core of this analysis involved distinguishing the dierent typ es of information expressed in the corpus

and for each of these typ es the linguistic forms used for its expression Such a detailed semantic analysis can

2

only b e carried out empirically by hand It was decomp osed in ve successive steps each with its particular

goal

Focusing on a restricted category of corpus sentences This restricted category dened b oth a sub

corpus for systematic and indepth analysis and a realistic target output for the prototyp e system

streak

Dening the ontology of the subcorpus This involved identifying the dierent typ es of entities men

3

tioned in the corpus as well as the dierent typ es of facts expressed ab out these entities

Identifying the semantic constraints on concept grouping inside the corpus sentence constituents These

constraints can b e viewed as schemata for phrase level content organization In streak they are

implemented in part in the phrase planning rules and in part in the revision rules

Identifying the various syntactic forms used in the corpus to express each concept combination This

resulted in a set of mappings from semantic to syntactic structures These mappings which I call

realization patterns capture the paraphrasing p ower of the domain sublanguage In streak they are

implemented in part in the lexicalization rules and in part in the revision rules

1

These rep orts were taken from the UPI newswire

2

I use corpus analysis as a metho dology for semantic and syntactic knowledge acquisition My work thus p ertains to the

know ledgebased tradition of NLP research and not to the statistical tradition of NLP research that is at times glossed as

corpusbased NLP in its most recent revival Statistical NLP attempts to build systems without any knowledge but which

instead rely on correlation b etween their input and o ccurrence probabilities computed on huge textual corp ora

3

Where there is no risk of ambiguity I will refer to the subcorpus simply as the corpus

Identifying a set of revision tools to attach oating facts to basic sentence structures as wel l as the

semantic syntactic and lexical constraints on their applicability This was carried out by a pairwise

comparison of realization patterns diering by only one concept Each revision to ol corresp onds to

a class of structural transformations to map the simpler realization patterns into the more complex

patterns with one more concept These revision to ols are implemented in the revision rules of streak

These ve steps are presented in order in the following subsections Together they dene a corpusbased

knowledge acquisition metho dology that allows identifying the range of expressive options that need to b e

implemented in a generator based on systematic empirical data This systematism in turn allows thorough

testing of the generator and quantitative evaluations of its output The dierential analysis of realization

patterns denes an approach to extract from text linguistic knowledge that is sp ecically geared towards

incremental pro cessing such as revision to ols

Fo cusing on a subcorpus

In order to fulll its goals the corpus analysis needed to b e simultaneously semantic syntactic very ne

4

grained and systematic Such an ambitious analysis could not have b een carried out on the total corpus

ie on each and every sentence of rep orts It was necessary to fo cus on a restricted category of corpus

sentence I dened this restricted category in two steps I rst made observations ab out the structure of

the corpus rep orts to cho ose a discursive fo cus I then distinguished b etween dierent typ es of information

conveyed in the corpus to cho ose a semantic fo cus

Rep ort structure and discursive fo cus

Journalism textb o oks dene a variety of standard rep ort structures The corpus rep orts are organized

following the socalled inverted pyramid structure with summary lead Fensch meaning that crucial

information is packed in the lead followed by facts of decreasing imp ortance towards the end of the rep ort

Summary typ e leads often consisting of a single sentence in the corpus rep orts are thus selfcontained mini

rep orts containing the basic facts This typ e of rep ort structure is not particular to sp orts rep orting but is

p ervasive in newswire articles Mencher It is preferred b ecause newswire rep orts are essentially used as

draft material to b e edited by client newspap ers under heavy timepressure and stringent space constraints

An inverted pyramid structure with a summary lead makes a rep ort instantly editable by cutting its tail

When space is really scarce only the lead remains

This particular structure of the corpus rep orts allowed me to narrow the discursive scop e of the analysis to

a single sentence p er rep ort the lead sentence This subcorpus of lead sentences was of manageable size for a

systematic manual analysis and at the same time allowed fo cusing on the two main linguistic phenomenon of

interest in this thesis summarization and historical contextualization In the literature ab out summarization

eg Borko an imp ortant distinction is made b etween indicative abstracts which provide metalevel

information on the text itself its goal structure style etc and informative abstracts which express the

content of the text in a less detailed and more compact form Summary leads are informative abstracts of the

rest of the rep ort Moreover of them contained historical information That such historical background

was part of the essential facts in a ma jority of rep orts conrms the imp ortance of historical information It

also allowed the detailed investigation on how its gets folded with new information into complex sentences

Content typ es and semantic fo cus

In order to dene a semantic fo cus for the negrained analysis I rst had to carry out a highlevel semantic

analysis of al l lead sentences in the corpus This highlevel analysis consisted of distinguishing dierent

typ es of information conveyed in these sentences Among these dierent typ es I singled out one particular

4

ie down to the individual word whenever necessary

typ e of new information boxscore facts and two particular typ es of historical information records and

streaks

A b oxscore is a table containing a standard set of facts for one game In daily newspap ers there is

one b oxscore accompanying each game rep ort It essentially contains data quantifying the various asp ects

scoring reb ounding passing etc of the total p erformance of each player and team for the entire game An

example b oxscore with the corresp onding rep ort is given in Fig In this rep ort negrained statistics not

available in the b oxscore such as p erformances over a short time interval during the game are emphasized by

an italic font historical information is as usual emphasized by a b oldface font Such ner grained statistics

come from complete game charts Anderson Boxscores are available online through newswire and

5

sp ort statistic computer services whereas game charts are not Therefore only b oxscores constitute a

realistic tabular input for a rep ort generation system This is the rst reason to cho ose b oxscore facts as

the semantic fo cus for new information There are three other reasons

They are the most common typ e of new information It was the sole typ e of new information in

6

of the corpus sentences

They are the typ e of new information most often asso ciated with historical information of the

corpus sentences in which b oxscore statistics was the sole typ e of new information also contained

historical information Only of the rst sentences containing new information other than b ox

score facts also contained historical information

They are not conceptually idiosyncratic to basketball or sp orts Other quantitative domains contain

statistics that are direct conceptual equivalent of b oxscore facts in the basketball domain For instance

the nal value of a nancial index in the sto ck market domain eg IRT property ended at

directly corresp onds an the end of the game statistic in the sp orts domain eg

nished with points

A record is the prop erty of a statistic to constitute a maximum or a minimum in a set of similar statistics

over some p erio d of time eg Michael Adams scored a careerhigh points A streak is the prop erty of

the game result to extend or interrupt a series of similar results for a given team eg the Denver Nuggets

ended a vegame losing streak with a victory over the Sacramento Kings

There are three reasons for cho osing records and streaks as the semantic fo cus for historical information

They can b e easily maintained from b oxscores they are also available online through sp orts statistic

computer services

They are the most common typ e of historical information together records and streaks were the sole

typ e of historical information in of the rst sentences

They are p ervasive in any quantitative domain as illustrated by the following examples of streaks

loss and from nance eg the Dow Jones Average of Industrials recorded its fourth straight

meteorology in New York the temperature remained below zero for the fth consecutive day

Fo cusing on b oxscore facts records and streaks denes a subcorpus of lead sentences for negrained

semantic and syntactic analysis The sp ecial prop erties of these three typ es of information listed ab ove insure

that this subcorpus

Constitutes a representative core of the whole corpus

Is ideal to study how historical background gets opp ortunistically woven into sentences conveying

related new facts

Covers the most realistically available online input data for the domain at hand

Should yield results and linguistic knowledge that generalize to other quantitative domains

5

To the b est of my knowledge

6

Note also in Fig that most of the information conveyed in the rep ort comes directly from the b oxscore

ORLANDO MAGIC

fg ptfg ft rb

Players mn ma ma ma ot bl st as to pf tp

Catledge

Reynolds

Rob erts

Vincent

Anderson

Kite

OSmith

Skiles

Scott

Je Turner

Willia ms

Team totals

Team

HOUSTON ROCKETS

fg ptfg ft rb

Players mn ma ma ma ot bl st as to pf tp

Johnson

Thorp e

Ola juwon

Maxwell

KSmith

Rollins

Floyd

Herrera

Jo Turner

Bullard

Team totals

Team

ORLANDO Fla UPI scored p oints Sunday night to pace the Houston Ro ckets to a victory

over Orlando giving the Magic their leaguehigh th straight loss

Hakeem Ola juwon contributed p oints reb ounds and eight blo cked shots b efore fouling out with remaining

in the game

The Magic led at halftime but Houston outscored Orlando in the rst eight minutes of the third quarter

to take the lead for good

Smith converted of shots from the eld and dished out nine assists to give the Ro ckets their sixth win in

eight meetings with Orlando

Scott Skiles provided spark of the b ench with p oints seven assists and four steals for the Magic which lost for

the th time in games

Otis Thorp e added p oints and reb ounds for the Ro ckets scored p oints despite for

sho oting from the eld

Orlando was outreb ounded and shot just p ercent from the eld The Magic have dropp ed their last

six home games

Font conventions in the rep ort italics information not coming from the boxscore b oldface historical information

Font conventions in the b oxscore b oldface statistic included in the rep ort

Abbreviations in the b oxscore mn MiNutesplayed fg FieldGoals ft FreeThrows rb ReBounds as ASsists st

STeals bs Blo cked Shots to TurnOver pf Personal Foul tp Total Points pt threePoinT m Made a

Attempted o Oensive t Total

Figure Standard b oxscore and corresp onding natural language rep ort Stat(WP) "Majerle scored 33 points"

Stat(LP) "44 points from Wilkins"

"Hardaway and Mullin Statistic Stat(WPG) combined for 67 points" "Phoenix shot 60.2% Stat(WT) from the field"

Box−Score Stat(game) Result(game) "Utah’s 99−88 win Fact over Denver"

"Reserve guard Quality(WP) Jay Humphries"

Quality Loc(game) "Phoenix, Az"

Quality(game) Time(game) "Friday night"

Length(game) "a triple−overtime victory" Streak Streak(WT) "its third straight win" Streak Extension Update Streak Streak(LT) "snapped their losing Interruption streak at 16 games"

Streak "their franchise−best Break Record(Streak) Record Sub−Onto 20 game win streak" Record Update Equal Statistic Record(Stat) "a career−best

Record Sub−Onto 33 points"

Figure Domain ontology

This subcorpus also denes the subdomain for the prototyp e system streak The ontology of this

subdomain is presented in the next section

Domain ontology

In this section I distinguish and exemplify the various typ es of information conveyed in the subcorpus dened

in the previous section The toplevels of the ontology dened by these subtyp es is shown in Fig

At the top level categories in this ontology are the three ma jor classes of facts that dened the semantic

scop e of this subcorpus one class of new facts b oxscore facts and two classes of historical facts records

and streaks ab out these b oxscore facts

Boxscore facts rst sp ecialize into statistics and qualities Then each category further sp ecializes in

terms of the domain entity they concern either an individual player a group of player an entire team

or the game as a whole Statistics ab out individual players sp ecialize into those concerning players of the

winning team statWP and those concerning players of the losing team statLP Each case is further

rened in terms of unit p oints reb ounds assists etc The only cases of statistics concerning either a group

of players statWPG or a whole team statWT were restricted to the winning team Corresp onding data

ab out the losing team was never deemed imp ortant enough in the app ear in the lead sentence Similarly

only qualities concerning players from the winning team qualityWP were conveyed in the lead The only

statistic ab out the whole game is its result sp ecifying the winner loser and nal score Though p erquarter

scores are available in standard b oxscores none of those app eared in the lead sentences Qualities ab out

the game include its lo cation time and length

Streak up dates sp ecialize along two orthogonal dimensions streak extensions vs streak interruptions

and streak of the winning team streakWT vs streak of the losing team streakWT Similarly record

up dates sp ecialize into breaking vs equaling typ es and orthogonally into records of game statistics and

records of streak lengths Record game statistics are further rened in terms of the various typ es of game

statistics from the subontology of game statistics dened ab ove Similarly record streaks are further rened

in terms of the subontology of streaks

The b ottom level of the ontological hierarchy where all lowlevel distinctions dierentiating various statis

tics are taken into account contains concepts

Corpus sentence structures

Toplevel structure concept clustering

The semantic structure of corpus sentences is dened by constraints on concept coo ccurrence ie what

concept combinations app ear inside particular sentence constituents Their surface structure is dened by

constraints on syntactic dep endency b etween the constituents realizing the various elements of each app earing

concept combination

I made two main observations on concept coo ccurrence inside whole corpus sentences The rst obser

vation is that the numb er of facts in these sentences varies from a minimum of four to a maximum of

The second observation is that each sentence contains one instance of each of the four following concepts

Game result resultgame

Game lo cation locgame

Game time timegame

Winning player statistic statWP

A game has only one result lo cation and time In contrast most corpus sentences contain several

instances of the concept winning player statistic However only one of these winning player statistics is

the fo cus of the sentence I call this particular statistic the main statistic statWP of the concept

combination

The conclusion of these two observations is that any concept combination app earing in a corpus sentence

is made of

The basic concept combination statWP resultgame locgame timegame

From zero to eight additional concepts from the domain ontology presented in Section

At the abstract concept combination level all corpus sentences are thus semantic sup ersets of the basic

four concept sentences

To discover the constraints on syntactic constituent dep endencies inside the corpus sentences I rst

lo oked at these basic sentences I observed that they all follow one of only two highlevel syntactic structures

These structures are illustrated on two example sentences in Fig In this gure sentence constituents

are represented by b oxes with their concept at the top and their text at the b ottom These b oxes are linked

by lines showing the dep endency relations b etween the constituents In b oth these sentences the structural

status of the syntactic constituents that resp ectively express the lo cation and the time of the game is identical

The lo cation is expressed by a constituent prexed to the rest of the sentence as a parenthetical while the

time is expressed by a oating adjunct that is attached to dierent constituents in dierent corpus sentences

What sets these two structures apart is the relation b etween the constituents that resp ectively express the

main statistic and the game result In the rst structure they are in parataxis while in the second they are

in hypotaxis with the main statistic as head In this pap er I use the notions of parataxis and hyp otaxis

dened in Halliday b ecause they are general relations b etween syntactic constituents o ccurring at al l

linguistic ranks sentence clause group Two constituents are in parataxis if they are b oth at the same

structural level Parataxis is thus a general symmetric relation covering b oth co ordination and app osition

In contrast two constituents are in hyp otaxis if it is p ossible to distinguish one as a main element and the

other as a dep endent element Hyp otaxis is thus a general asymmetric relation covering b oth headargument

dep endency and headadjunct dep endency I take clause sub ordination to b e a sp ecial case of hyp otaxis

where b oth the head and the dep endent are clauses

When I lo oked at the complex corpus sentences I observed that they have the same structural charac

teristics as the basic sentences In these complex sentences the lo cation is still expressed as a parenthetical

prex the time is still expressed by a oating adjunct there are still two toplevel constituents one in

cluding the main statistic and one including the game result and these two constituents are still either in

parataxis or in hyp otaxis with the constituent including the mainstatistic as head The only dierence from

the basic sentence structures is that additional concepts are now group ed with the main statistic andor

with the game result in these two toplevel constituents

An example of a hyp otactically structured complex sentence is shown in Fig This sentence contains

nine facts the four basic facts with their concept highlighted in b old three historical additional facts in

dashed b oxes and two nonhistorical additional facts Compare this sentence with the basic sentence at the

b ottom of Fig They b oth share the same highlevel structure hyp otaxis with the constituent including

the main statistic as head and the constituent including the game result as dep endent In the complex

sentence however the head constituent contains four additional facts and the dep endent constituent contains

one additional fact The corpus also contained paratactically structured complex sentences with additional

facts group ed with the main statistic andor with the game result

In short the toplevel structure of any corpus sentence is one of the two basic structures illustrated in

Fig with additional facts clustered either around the main statistic or around the game result Any

corpus sentence is thus made of two halves In the rest of this thesis I call the half containing the main

statistic the statistic cluster and the half containing the game result the result cluster A systematic analysis

of these two clusters revealed that whether a fact is part of the statistic cluster or the result cluster is directly

dep endent on its ontological class Among the classes of additional facts from the domain ontology shown

in Fig only one statWPG ie statistic of winning player group was part of the statistic cluster in

some corpus sentences and part of the result cluster in some others The instances of all the other classes

consistently app eared in the same cluster This concept clustering phenomenon is illustrated in Fig

It shows where additional instances of each concept gets attached These semantic constraints on concept

clustering can b e viewed as schemata McKeown for sentencerank planning

Figure also shows that the overall structure of any corpus sentence dep ends on only three factors

The toplevel relation b etween the two clusters either parataxis or hyp otaxis with the statistic cluster

as head

The internal structure of its statistic cluster

The internal structure of its result cluster

These corpus observations show that any complex sentence can indeed b e generated in two steps

pro duce a basic sentence realizing the obligatory content units incrementally revise it to incorp orate

supplementary content units Furthermore they indicate that supplementary content units can b e attached

within a cluster based on local constraints thus simplifying b oth generation and the rest of the corpus

analysis When I pursued the analysis inside each cluster I split the whole sentence corpus into two subsets

one containing statistic clusters and the other result clusters This cluster internal analysis is discussed in

the next section Parataxis : "New York − Kiki Vandeweghe scored 21 points Friday Night and the routed the 106−79."

& and

stat(WP) result(game)

Kiki Vandeweghe the New York Knicks routed scored 21 points the Philadelphia 76ers 106−79

time(game) loc(game) Friday night New York

Hypotaxis : "Charlotte, Va − scored 41 points to lead the New York Knicks to a 97−79 victory over the Tuesday night."

stat(WP)

Patrick Ewing scored 41 points

time(game) result(game) loc(game) Tuesday night to lead the New Yorks Knicks to

Charlotte, Va a 97−79 victory over the Hornets

Figure Toplevel structures of basic sentences

&

and

stat(WP1) &

Karl Malone and scored 28 points stat1(WP2) stat2(WP2) streak(WT)

leading the Utah Jazz to record(stat1(WP2)) record(stat2(WP2)) its 4th straight victory

John Stockton added handed out a a season−high 27 points league−high 23 assists result(game) loc(game) a 105−95 win over the Los Angeles time(game) Saturday

"Los Angeles − Karl Malone scored 28 points Saturday and added a season−high 27 points and

a league−high 23 assists leading the Utah Jazz to its 4th straight victory, a 105−85 win over the Los Angeles Clippers."

Figure Complex sentence with toplevel hyp otactic structure Hypotaxis or Parataxis

stat(WP) result(game)

loc(game) stat(WP) time(game) record(streak)

streak quality(WP) stat(WPG) length(game) record(stat(WP)) stat(WT)

stat(LP)

Statistic cluster Result cluster

Figure Concept clusters

Internal cluster structure realization patterns

The analysis of the toplevel corpus sentence structure presented in the previous section resulted in a set

of semantic constraints to guide sentence planning These semantic constraints dene the set of concept

combinations observed in sp ecic corpus sentence constituents that I called clusters In order to identify

linguistic constraints on sentence realization I then lo oked for each clustered concept combination at the

set of syntactic structures realizing it in the corpus I call a mapping from a concept combination onto a

syntactic structure a realization pattern

Four examples of realization patterns are given in Fig I represent realization patterns as arrays with

each column corresp onding to a syntactic constituent and each row corresp onding to some level of information

7

ab out this constituent The rst row indicates the semantic element that the constituent realizes the second

row its grammatical function the third row its structural status ie head argument adjunct or conjunct

8

and the fourth its syntactic category Below each realization pattern I give two example phrases from the

corpus illustrating the pattern

9

Realization patterns capture the structural paraphrasing p ower of the corpus sublanguage In Fig

patterns R and R capture two alternatives ways to convey a game result Patterns R and R capture

d1 d2 i1 i2

two alternative ways to convey a game result together with a streak up date that this result triggers These

patterns abstract from lexical material and syntactic details eg connectives mo o d to fo cus on represent

ing the dierent mappings from semantic structure to syntactic structure Consider for example the two

alternative patterns R and R given in Fig for the realization of the same concept combination C

i1 i2 i

7

Some rows are double A single line separates the two halves of a double row whereas a double line separates two dierent

rows

8

The particular grammatical functions structural relations and syntactic categories used in this thesis are those of surge

and are describ ed in detail in App endix B

9

But abstracts from the more pro ductive grammatical and lexical paraphrasing cf Section for a detailed discussion

of these dierent typ es of paraphrasing

C gameresultwinnerloserscorestreakwinnerasp ecttyp elength

i

C gameresultwinnerloserscore

d

R C

i i

winner gameresult loser score length streakasp ect typ e

agent pro cess aected score frequency

arg head arg adjunct adjunct

prop er verb prop er numb er PP

prep det ordinal adj noun

Chicago b eat Pho enix for its rd straight win

New York defeated Seattle for its th consecutive victory

R C surface decrement of R C

d d i i

winner gameresult loser score

agent pro cess aected score

arg head arg adjunct

prop er verb prop er numb er

Seattle defeated Sacramento

Detroit routed Boston

R C

i i

winner asp ect typ e streak length score gameresult loser

agent pro cess aectedlocated lo cation instrument

arg head arg adjunct adjunct

prop er verb NP PP PP

det participle noun prep det numb er noun PP

Utah extended its winning streak to games with a triumph over Denver

Boston stretching its winning spree to outings with a rout of Utah

R C surface decrement of R C

d d i i

winner score gameresult loser

agent pro cess range

arg head arg

prop er supp ortverb NP

det numb er noun PP

Chicago claimed a victory over New Jersey

Orlando recorded a win against Dallas

Figure Realization pattern examples

10

In these two patterns the semantic elements streak and aspect are mapp ed onto syntactic constituents

in drastically dierent ways In pattern R they are b oth expressed by an adjective eg straight or

i1

consecutive inside a purp ose adjunct PP In contrast in pattern R aspect is expressed by a head verb

i2

eg to extend or to stretch and streak by a head noun eg streak or spree of the ob ject NP

I identied the realization patterns of each clustered concept combinations nding a total of patterns

for combinations Each of these realization patterns can b e obtained by applying a set of information

adding revisions to one of the three realization patterns of the basic four concept combination In this thesis

instead of going through the list of the patterns resulting from these revisions I fo cus on the more concise

presentation of the revision op erations themselves Example of these op erations are given in the next section

Their exhaustive list is given in App endix A

Revision to ols

Dierential analysis of realization patterns identifying revision to ols

Realization patterns sp ecify syntactic structures for entire concept clusters For example consider again the

complex sentence from the corpus whose structure was analyzed in Fig

Karl Malone scored p oints and John Sto ckton added a seasonhigh p oints and handed out a league

high assists leading the Utah Jazz to its fourth straight victory a win over the Los Angeles

Clipp ers

This sentence contains two realization patterns

One pattern for its ve fact statistic cluster Karl Malone scored points and John Stockton added a

seasonhigh points and handed out a leaguehigh assists

Another pattern for its two fact result cluster leading the Utah Jazz to its fourth straight victory a

win over the Los Angeles Clippers

Such realization patterns are thus to o coarse grained to b e usable by a revisionbased generator exploiting

comp ositionality inside clusters The linguistic data needed by such a system is a set of subpatterns sp ecifying

the realization of a single additional concept in the context of a preexisting draft cluster structure For

example one such subpattern would sp ecify that in the context of a draft game result cluster like leading

the Utah Jazz to a win over the Los Angeles Clippers an additional streak concept can b e realized

by an app osition like its fourth straight victory

In order to identify these contextual subpatterns I carried out a dierential analysis of the cluster

realization patterns A pictorial description of this dierential analysis is given in Fig This analysis is

based on the notions of semantic decrement and surface decrement A cluster C is a semantic decrement

d

of cluster C if C contains all but one of C s concepts Each cluster has a set of realization patterns

i d i

asso ciated with it The surface decrement of a realization pattern of C is the realization pattern of C

i d

that is structurally closest Structural distance is dened as the numb er of matching cells in the arrays

representing their resp ective realization patterns In cases of tie the cells of some rows such as the one for

the constituent realizing the semantic head are given more weight

Figure shows a semantic decrement pairing C a single concept with C which contains two con

d i

cepts Both clusters have two realization patterns asso ciated with them as they each can b e realized by two

dierent syntactic structures These four syntactic structure patterns must b e compared to nd the surface

decrements Since R is entirely included in R and R is not R is the surface decrement of R Such a

d1 i1 d2 d1 i1

case of inclusion is the simplest typ e of relation b etween a pattern and its surface decrement However the

maximum overlap b etween a realization pattern and its decrement is sometimes only partial To identify the

10

The asp ect of a streak up date dierentiates streak extensions from streak interruptions It is an entirely dierent notion

from the grammatical asp ect of a clause Concept Clusters Space of Realization Patterns

Linguistic Realization Ci = {U1 ... Un−1,Un} Rp(Ci) Rq(Ci)

Semantic Surface Surface Decrement Decrement Decrement

Cd = {U1 ... Un−1} Rx(Cd) Ry(Cd)

Linguistic Realization

Figure Dierential analysis of realization patterns

surface decrement of R for example we need to compare it to R and R in turn In R only the rst

i2 d2 d1 d1

column matches a column in R In contrast in R in addition to the rst column columns to also

d2 d2

if only partially match some column in R In particular the semantic head gameresult is mapp ed

d2

onto a noun in b oth R eg victory and R eg triumph whereas in R it is mapp ed onto a

d2 i2 d1

verb eg to defeat Therefore it is R rather than R that is the surface decrement of R However

d2 d1 i2

R only partially overlaps R since its second column do es not corresp ond to any column in R

d2 i2 i2

By systematically listing the surface decrements and conversely the surface increments of each of the

realization patterns resulting from the previous step of the analysis I identied surface decrement

pairs I then considered each of these pairs as an instance of revision The simplest pattern in the pair is

the base pattern of the revision and the most complex pattern in the pair the revised pattern For example

in Fig patterns R and R play the role of base patterns whereas patterns R and R play the role

d1 d2 i1 i2

of revised patterns

For each revision instance there is a set of structural transformations to change the base pattern into

11

the revised pattern Each such set of structural transformations denes a revision to ol In the last step

of this corpus analysis I classied revision to ols in terms of their structural characteristics This last step is

presented briey in Section and in full detail in App endix A

Classifying revision op erations

Classifying the structural dierences b etween surface decrement pairs resulted in a hierarchy of revision op

erations The toplevel of this hierarchy shown in Fig distinguishes b etween monotonic revisions which

are abstracted from fully overlapping decrement pairs and involve only attachment of a new constituent and

nonmonotonic revisions which are abstracted from partially overlapping decrement pairs and also involve

displacement andor deletion of draft constituents Monotonic revisions consist of a single transformation

which preserves the base pattern and adds in a new constituent In nonmonotonic revisions an introductory

transformation breaks up the integrity of the base pattern in adding in new content Subsequent restruc

turing transformations are then necessary to restore grammaticality Monotonic revisions can b e viewed as

elab orations while nonmonotonic revisions require true revision In the next two sections I discuss and give

examples of these two dierent classes of revision to ols

11

The transformations involved in the application of a revision to ol are not related to the notion of transformation as dened

by transformational grammarians In transformational grammars a transformation relates surface forms that are grammatical

paraphrases of a common deep semantic form In contrast a transformation involved in the application of a revision to ol maps

a surface form S realizing a deep semantic form D onto a surface form S realizing a distinct deep semantic form D which

1 1 2 2

contains D plus an additional fact

1 Monotonic Revisions Non−monotonic Revisions

AdjoinAbsorb Conjoin Append RecastAdjunctization NominalizationDemotion Promotion

Figure Revision op eration hierarchy toplevels

Monotonic revision

I identied four main classes of monotonic revisions Adjoin Append Conjoin and Absorb They dier

from each other in terms of either the typ e of the base structure on which they can b e applied or the typ e

of revised structure they pro duce

For example Adjoin applies only to hyp otactic base patterns It consists of the intro duction of an

additional optional constituent A A is thus adjoined to the base constituent under the base head B The

c c h

12

revision schema of an adjoin is shown in Fig

Like most monotonic revision to ols Adjoin is versatile The variety of adjoin revision op erations is

shown in Fig The analyzed corp ora contained cases where the new constituent was added to a nomi

nal abbreviated NP in the revision hierarchy and others where it was added to a clause abbreviated S in

the revision hierarchy When adjoined to a nominal the new constituent could ll the following syntactic

functions partitive classier describ er and qualier For the qualier syntactic function the added con

stituent came in two syntactic forms nonnite clause and relative clause Finally a relative clause could

express a given typ e of additional information equally well when adjoined to dierent draft sub constituents

of the same syntactic category nominal but emb edded at dierent levels in the draft structure Consider

for example adding streak information to the draft phrase

to p ower the Golden State Warriors to a triumph over the Los Angeles Clipp ers

The same revision to ol Adjoin Relative Clause to Nominal can b e applied to the emb edded nominal

referring to the losing team underlined and yielding

to p ower the Golden State Warriors to a triumph over the Los Angeles Clipp ers who lost for

the ninth straight time

12

The pictorial conventions used in this gure and in Fig are the following constituents are circles or ovals structural

relationships are lines the lines corresp onding to role relations either argument or adjunct are lab eled The elements added

by the revision either constituents or relations are b oldfaced

Base Structure Revised Structure

Bh Bh

Adjunct

Bc1 . . . Bcn Bc1 . . . Bcn Ac

Figure Adjoin schema Adjoin

NP−Adjoin S−Adjoin

partitive classifier describer qualifier origin frequency result time co−event

relative−S Non−finite−S PP PP Non−finite−S Finite−S Non−finite−S

top bottom

Figure Adjoin revision op eration hierarchy

Alternatively to the toplevel nominal conveying the game result underlined as a whole and yielding

to p ower the Golden State Warriors to a triumph over Los Angeles that extended the Clip

p ers losing streak to nine games

The rst revision is thus called Adjoin of Relative Clause to BottomLevel Nominal and the second

Adjoin of Relative Clause to TopLevel Nominal

When adjoined to a clause the new constituent could ll the following syntactic functions frequency

result time and coevent with only a single syntactic category for the adjoined constituent in each case

Adjoin was used to add b oth typ es of historical information streaks and records as well as nonhistorical

information

The pair of phrases b elow illustrates how Adjoin can b e used to add a record information to a game

statistic nominal

Base phrase Armon Gilliam scored p oints

Revised phrase Armon Gilliam scored a franchise record p oints

The noun comp ound franchise record corresp onding to A in the schema is simply added as a pre

c

mo difying classier of the p oints nominal corresp onding to B in the schema This is a case of

c

Adjoin of Classifier

The surface decrement pair R R given Fig Section provides an example of Adjoin of

i1 d1

Frequency PP to Clause In this example the purp ose PP realizing an additional streak fact eg for

its third straight win is adjoined to the clausal base structure realizing a basic game result cluster eg

Chicago defeated Phoenix resulting in the revised cluster Chicago defeated Phoenix for its

third straight win

Note how in b oth these examples the linguistic expression of the base phrase content is not aected by

the revision The three other typ es of monotonic revisions Conjoin Append and Absorb are presented in

App endix A

Nonmonotonic revisions

I identied ve main classes of nonmonotonic revisions Nominalization Adjunctization Recast

Argument Demotion and Coordination promotion Each typ e is characterized by a dierent set of restruc

turing transformations which involve displacing base constituents altering the base argument structure or

changing the base lexical head These typ es of transformations also distinguish nonmonotonic revisions

as a whole from monotonic revisions Monotonic revisions conserve b oth the argument structure and the

lexical head of the base and do not involve moving base constituents around Base Structure Revised Structure

Vs Vf

Arg1 Argn Arg1' Argn' Adjunct

Bc1 . . . . . Bcn Bc1 . . . . . Ac Bcn

Figure Adjunctization schema

Nonmonotonic revision to ols tend to b e less versatile than monotonic ones and b e applicable only to

sp ecic typ es of base structure For example Adjunctization applies only to clausal base patterns headed

by a supp ort verb V Following Gross Gross I call support verb any verb that do es not realize

s

any semantic element App earing only b ecause each clause syntactically requires a verb in English its sole

function is to supp ort one of its meaningb earing arguments Supp ort verbs are opp osed to ful l verbs which

do convey a semantic element by themselves A supp ort verb can have an argument set of various length

The supp orted argument can ll various semantic roles but it distinguishes itself from the other arguments

in that it must b e realized by an NP whose head collo cates with the supp ort verb In the corpus sentence

examples of the realization pattern R given in Fig of Section these collo cations are verbob ject

d2

collo cations eg to record win The head verb eg to record do es not realize any semantic

element but only syntactically supp orts its range role the NP eg a win against Dal las whose

head noun egwin realizes the semantic head of the game result fact

The general Adjunctization schema is given in Fig The additional content is realized by a

combination of two new constituents a fullverb V ie a verb that b ears meaning on its own and its new

f

migrates to ob ject A Deprived of the head verb that supp orted it in the base clause original ob ject B

c c

n

an adjunct p osition in the revised clause It has thus b een adjunctized

The pair of phrases b elow illustrates how Adjunctization can b e used to add a streak interruption to a

game result clause

Source phrase the Denver Nuggets claimed a victory over the Dallas Mavericks

Target phrase the Denver Nuggets ended their three game losing streak with a victory

over the Dallas Mavericks

The streak information is intro duced by a new fullverb to end corresp onding to V in the schema

f

and a new ob ject nominal their three game losing streak corresp onding to A in the schema Deprived

c

of its supp ort verb to claim corresp onding to V in the schema the original ob ject nominal a

s

victory over the Dallas Mavericks corresp onding to B in the schema conveying the game result migrates

c

as an instrument adjunct Since the thematic role of the original ob ject in the source phrase was range this

is a case of Adjunctization of Range Argument into Instrument

The surface decrement pair R R of Fig in Section provides another example application

d2 i2

of this revision to ol this time to add a streak extension Note how in b oth examples the supp ort verb of the

base pattern is deleted by the revision

The other typ es of Adjunctization are shown in Fig Cases of Adjunctization are rst character

ized by the target adjunct role of the displaced constituent In the corp ora analyzed there were three such

target roles instrument opposition and destination Constituents moved to the destination role were

all originally lling the created role and all those moved to the opp osition role were originally lling the Adjunctization

>destination >opposition >instrument

created> affected> range> created> location>

Figure Adjunctization revision op eration hierarchy

affected role Those moved to the instrument role were coming from either created range or location

role p ositions

The four other typ es of nonmonotonic revisions Nominalization Recast Argument Demotion and

Coordination promotion are presented in App endix A

Side transformations

The previous subsection presented the core transformations involved in the corpus revisions

The intro ductory transformation whose goal is to attach the additional fact to the base pattern

The restructuring transformations whose goal is to maintain grammaticality

Some revisions however also involve side transformations that satisfy other goals such as avoiding rep

etitions avoiding ambiguities and enforcing collo cation constraints I identied ve main classes of side

transformations Reference Abridging El lipsis Argument Control Scope Marking Lexical Adjustment and

Ordering Adjustment

The most common of these transformations is reference abridging Its goal is to suppress rep etitions

intro duced by the attachment of the additional fact In the corpus initial references use a set of default

prop erties asso ciated with the class of the referred entity in the domain ontology For example initial

references to players use their rst and last name initial references to teams use their name and lo cation

etc However when a revision intro duces a second reference this set of prop erties is distributed b etween

13

these two references The base reference is thus abridged by the revision

The surface decrement pair R R b elow illustrates this phenomenon

d3 i3

R to direct the past the Washington Bullets

b3

R to direct the Los Angeles Lakers past Washington handing the Bullets their ninth straight

r 3

defeat

In the base pattern Rb b oth the name the Bullets and the lo cation Washington of the losing team

are used in the single reference to that team When this cluster is revised the added constituent contains a

second reference to this team In this second reference the team name alone is used As a side transformation

the rst reference is abridged using the team lo cation alone Without such a side transformation the team

name would b e rep eated just three words after b eing intro duced

0

to direct the Los Angeles Lakers past the Washington Bullets handing the Bullets their R

r 3

ninth straight defeat

13

Note that the use of this transformation is based on the assumption that the targeted audience of the rep ort knows ab out

every prop erty in the default set since otherwise it could not establish the coreference link in the revised pattern

The other typ es of side transformations are presented in App endix A

Classifying criteria

Table summarizes the essential characteristics of the revision to ols I identied in the corpus The columns

of this table provide ve essential criteria to classify these to ols

The typ e of base pattern structures on which they can b e applied whether a to ol is applicable only

to hyp otactic bases only to paratactic bases or equally well to b oth

The typ e of revised pattern structures resulting from their application either paratactic or hyp otactic

and in this latter case whether the revised pattern head is the base pattern head or the new constituent

intro duced by the revision

How base pattern constituents are displaced by the revision

How the argument structure of the base pattern is aected by the revision whether it is expanded

shrunk etc

How the lexical head of the base pattern is aected by the revision and in particular whether the

revision involves replacing a supp ort verb by a full verb or viceversa

Each distinct combination of these ve factors denes a main class of revision Each instance of revision

can b e further characterized by taking into account the following factors

14

The linguistic rank of the revision base pattern clause nominal etc

The syntactic category of the constituent added by the revision

The grammatical function lled by this added constituent in the revised cluster

The eight criteria ab ove classify revisions in terms of core transformations For revisions that also involve

side transformations the various typ es of side transformations constitute an additional classifying criterion

The resulting hierarchy whose toplevel was given in Fig and whose adjoin and Adjunctization

subtrees where resp ectively given in Fig and Fig is fully presented in App endix A

Constraints on revision to ols application

Table summarizes ve dierent constraints on the application of each of the ten basic revision to ols The

rst column sp ecies the typ es of side transformations that may accompany their application and the second

column the linguistic ranks of the base patterns onto which they are applied The nal three columns sp ecify

the typ es of facts they can add streak records andor b oxscore facts ie nonhistorical information

Table also contains o ccurrences of each of these constraints in the corpus of surface decrement pairs

For example concerning the monotonic revision to ol Adjoin presented in Section table indicates

that there were total o ccurrences of its usage in the corpus breaking down into o ccurrences at the

nominalrank and at the clauserank It also notes that among the clauserank o ccurrences eight were

used to convey a streak one to convey a record and the seven remaining to convey a nonhistorical statistic

Finally it indicates that o ccurrences of Adjoin involved a side transformation reference adjustments

and two ordering adjustments

Summary

The corpus analysis describ ed in this section has fullled ve functions

14

The constraints on this rank for each basic to ol typ e are given in table

base revised sub constituent argument head

structure structure displacement structure

adjoin hyp otaxis hyp otaxis none untouched untouched

head base const

app end parataxis parataxis none untouched untouched

conjoin any parataxis none demoted demoted

absorb any hyp otaxis none demoted demoted

head added const

recast hyp otaxis hyp otaxis argument untouched untouched

head base adjunct

argument hyp otaxis hyp otaxis toplevel arg shrunk untouched

demotion head base const emb edded const

nominalization hyp otaxis hyp otaxis none expanded fV sV

head added const

adjunctization hyp otaxis hyp otaxis argument changed changed

head added const adjunct

co ordination hyp otaxis w parataxis multiple demoted demoted

promotion paratactic arg

Table Comparative structural characterization of the revision to ols

side scop e streak records non

transformations ranks info info histo

adjoin reference adjust clause

ordering adjust nominal

app end ellipsis clause

nominal

conjoin reference adjust clause

scop e marking NP co ordination

lexical adjust NP app osition

ellipsis

absorb argument control clause

nominal

recast none clause

nominal

argument none clause

demotion

nominalization none clause

adjunctization reference adjust clause

co ordination none clause

promotion

Table Constraints on revision to ols usage

Identify the ontology of the basketball rep ort domain

Identify the sentencerank schemata used in the rep ort leads

Identify realization patterns determining which syntactic structures can b e used to express each concept

combination allowed by these schemata

Identify revision to ols to incrementally build complex realization patterns from basic ones

Identify constraints on the applicability of these revision to ols

The corpus analysis thus provided the data for all the knowledge sources of the prototyp e generator

streak It has also dened a target output which is useful to evaluate the success of streak Finally

it has conrmed the plausibility of the revision approach by showing the highly comp ositional and regular

structure of the complex sentences found in the corpus

Chapter

A new generation architecture

In the intro duction of this thesis I prop osed a new draft and revision approach to language generation

addressing the dicult issues raised by summarization applications Implementing this new approach requires

dening a new generation system architecture The ob ject of the present chapter is to motivate and present

this architecture in detail

Language generation is a very complex pro cess involving a large and heterogeneous set of tasks The

architecture of a generation system sp ecies

The decomp osition of the system into comp onents each resp onsible for a sp ecic set of subtasks

The knowledge sources accessible to each comp onent

The typ e of internal representations exchanged among comp onents

The thread of control b etween comp onents

The architecture presented in this chapter is for a complete rep ort summary generation system pro ducing

multisentential text from raw quantitative data As much as p ossible it is presented indep endently of any

sp ecic implementation One sp ecic implementation of the most original parts of this architecture the

system streak is presented in the next two chapters

No standard terminology has emerged for the various subtasks of text generation Dierent researchers

have used the same terms to mean dierent things in the literature This has obscured the issues To avoid

this pitfall this chapter starts by dening a terminology for discussing architectural issues that is then used

throughout the thesis I then motivate the new architecture I prop ose in two steps I start from a set

of observations resulting from the corpus analysis of humanwritten summaries presented in the previous

chapter From these observations I identify a set of abilities needed by a generator to pro duce similar texts

I then derive a set of design principles which provide a generator with these abilities Finally I present in

detail the prop osed architecture based on these principles explicitly stating where and how each generation

subtask dened gets carried out

Text generation subtasks

One problem in discussing architectural issues in generation is the absence of a precise and standard termi

nology the same terms have b een used to describ e dierent notions by dierent authors in the literature

In this subsection I briey dene a vo cabulary that I will then use throughout the pap er

Text generation is traditionally decomp osed in three subtasks

Content determination answering the question what to say

Content organization answering the question whenwhere to say what

Content realization answering the question how to say it

Together the rst two subtasks are also generally referred to as content planning deep generation or

strategical generation The third subtask has b een alternatively called language synthesis surface realization

surface generation or tactical generation Useful as they are these threefold or twofold decomp ositions are

to o coarse grained Each of these three tasks is in itself very complex To b e able to discuss how these

twothree subtasks are carried out and compare what they recover in dierent generation architectures it

is necessary to make additional distinctions that further decomp ose the text generation pro cess

Content determination can b e itself decomp osed into content production and content selection Although

in many cases an underlying application provides the generator with all the p otential content in other cases

the generators input is but one part of that content The rest of the content has to b e pro duced by the

generator itself This is what Hovy calls interpretation in generation Hovy the generator needs to

enrich its input with more content In the extreme case the input consists only of communicative goals and it

is the generators task to pro duce all content from the knowledge sources it can access Another case requiring

content pro duction by the generator is when its input format is totally inadequate for content planning The

generator must rst pro duce content of an acceptable format b efore b eing able to start planning This is

the case for the typ e of generation application discussed in this thesis The raw quantitative data input

to the generator in tabular form must rst b e converted to a symb olic form which captures conceptual

generalizations and on which domain reasoning can b e p erformed This task of content production subtly

diers from content selection which consists of deciding which part of the pro duced content is to b e included

in the generated text

What each generation subtask covers dep ends on the level of microcoding at which generation is p erformed

A macroco ded generator like ana Kukich pro duces sentences by assembling entire clauses and group

patterns stored as a whole in its phrasal lexicon In contrast a microco ded generator like epicure Dale

1

dynamically builds from a wordbased lexicon every sentence constituent down to the group rank The

level of micro co ding is dened by the minimal linguistic rank of the entries stored in the generators lexicon

It is a crucial architectural characteristic b ecause apart from content realization it also has rep ercussions on

content selection and content organization Both tasks signicantly dier in nature dep ending on the rank

of the linguistic unit b eing planned text paragraph sentence clause or group In particular macro co ded

generators are not concerned with content selection and organization at the clause and group ranks For these

systems everything b elow the sentence level is a realization matter In contrast for micro co ded generators

the distinction b etween content selection content organization and content realization is relevant even at

the clause and group ranks Thus while for Ana generating The industrial average instead of The Dow

Jones average of industrials is a realization decision for epicure generating the ripe banana instead

of the banana results from content selection and organization choices Generating the large shrimp and

not the big prawn would b e a genuine realization choice for epicure In the rest of the pap er I therefore

distinguish b etween discourse level content selection and organization at the text and paragraph ranks and

phrase level content selection and organization at the sentence clause and group ranks One of the most

innovative characteristics of the generation mo del prop osed in this thesis is that it allows microlevel content

selection and organization to b e p erformed under surface form constraints

Content realization can b e decomp osed into the four following subtasks

Lexicalization the expression of semantic content by choice of op enclass lexical items

Semantic grammaticalization the expression of semantic content by choice of grammatical category

eg clause vs NP and grammatical features eg tense for clauses deniteness for NPs

Morphosyntactic grammaticalization the enforcement of syntax by choice of closedclass lexical items

choice of op enclass lexical item inections sp ecication of constituent precedence constraints etc

Linearization the sp elling out of the inected lexical items following the precedence constraints

It is during the rst two of these subtasks that the mapping b etween content units and linguistic units

o ccurs everything preceding them is purely conceptual manipulation while everything following them purely

1

ie a lexicon whose entries are individual words

syntactic manipulation Because they bridge the gap b etween conceptual and syntactic pro cessing these two

tasks have b een considered part of content planning by some authors while part of content realization by

others I hold this second view

The overall text generation task can thus b e decomp osed as follows

Content determination

Content pro duction

Content selection

Discourse level content selection

Phrase level content selection

Content organization

Discourse level content organization

Phrase level content organization

Content realization

Lexicalization

Grammaticalization

Semantic grammaticalization

Morphosyntactic grammaticalization

Linearization

The present thesis fo cuses on the following generation subtasks phrase level content selection phrase

level content organization and all four content realization subtasks

Motivation for a new architecture

In the intro duction of this thesis I presented the dicult issues that summarization applications raise for

a language generator conciseness sentence complexity oating concepts historical background and para

phrasing p ower In this subsection I prop ose four principles for designing a generation system that handles

these dicult issues incremental draft and revision approach micro co ding from a wordbased lex

icon presence in the draft representation of a purely conceptual layer indep endent of linguistic form

and presence in the draft representation of a surface layer reecting realization choices These principles

are essentially motivated by a set of corpus observations on the way historical information is conveyed in

humangenerated rep orts From these observations I deduce a set of abilities that a generation system must

have in order to concisely and exibly convey historical information These abilities can b e provided by the

four design principles ab ove

From corpus observations to needed abilities

The corpus observations which motivate the need for a new generation architecture are the following

Historical information is combined with new information in sentences of a greater complexity than

those pro duced by existing generation systems

Historical information of the same typ e is conveyed by a wide range of dierent syntactic constructs

Historical information of the same typ e is conveyed at dierent linguistic ranks

Historical information of the same typ e is scattered in distant lo cations inside a rep ort

Variety of syntactic constructs to convey the same additional historical fact

Clausecomplex coordinative conjoin

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a

victory over Denver and handing the Nuggets their seventh straight loss

Adjoin of nonfinite clause in toplevel nominal

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a victory

over Denver sending the Nuggets to their seventh straight loss

Adjoin of relative clause in toplevel nominal

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a

victory over Denver that extended the Nuggets losing streak to seven games

Adjoin of relative clause in embedded nominal

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a victory

over the Denver Nuggets who lost for the seventh consecutive time

Toplevel nominal appositive conjoin

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a victory

over Denver the Nuggets seventh straight defeat

Embedded nominal appositive conjoin

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a victory

over the Denver Nuggets losers of seven in a row

Figure Syntactic diversity of historical information

Taking into account historical information triggers a combinatorial explosion of the numb er of relevant

facts to consider for inclusion in the rep ort

Assuming a draft and revision approach the applicability of revision to ols for adding an historical fact

on a given draft sentence is constrained by the surface form of that sentence

In the following paragraphs I review in turn each of these observations and their consequences in term

of desired abilities for a rep ort generator conveying historical information

The rst three of these observations have already b een discussed in the intro duction of this thesis The

rst observation the syntactic complexity asso ciated with historical information was summarized in the table

of Section comparing corpus sentences containing historical information with sentences pro duced by

existing generators in terms of numb er of words and facts This table showed that the ability to convey

historical information entails the ability to generate very complex sentences

The second observation the syntactic variety asso ciated with historical information is illustrated in

Fig rst presented in the intro duction on p and duplicated here This gure contains six corpus

paraphrases containing an historical streak fact emphasized by a b oldface font In each of these paraphrases

the same historical fact is attached to the same basic sentence by a dierent syntactic construct The ability to

exibly convey historical information thus entails the ability to generate a wide variety of syntactic constructs

below the sentence rank

The third and fourth observations are based on the notion of oating semantic element A fact to

convey in a rep ort is a oating semantic element if it can alternatively b e realized at various levels inside

the rep ort structure Floating semantic elements are opp osed to xed semantic elements which are always

realized at the same given level In Section I noted that in the corpus rep orts the main statistic and

the game result were systematically realized by the two toplevel clauses of the rst sentence These two

facts are thus examples of xed semantic elements There are two asp ects to the phenomenon of oating

semantic element an intrasentential asp ect and a intersentential asp ect Intrasentential oating refers

to the prop erty of a fact to b e alternatively realizable at dierent linguistic ranks inside a given rep ort

sentence Intersentential oating refers to the prop erty of a fact to b e alternatively realizable in dierent

and p erhaps distant rep ort sentences In the corpus historical information displayed b oth characteristics of

2

oating information

The six paraphrases of Fig illustrate intrasentential oating of an historical fact In these para

phrases the same historical fact is realized at dierent linguistic ranks clausecomplex in clause in

and nominal in and This example shows that the ability to exibly convey historical

information entails the ability to realize oating semantic elements at dierent linguistic ranks

The corpus rep ort of Fig shows how an historical fact can oat across the entire rep ort structure

Consider where the streak fact emphasized by a b oldface font ab out each team is conveyed Denvers

streak is attached to the game result in the rst sentence of the rep ort In contrast Sacramentos streak is

attached to the scoring statistic of one of its players in the last paragraph It is essential to note that the

distance separating the two streak facts in this rep ort cannot b e explained on semantic or syntactic grounds

The only semantic constraint on the attachment of a teams streak to a draft sentence is that this sentence

contains a reference to that team The rst two sentences of the rep ort contain a reference to Sacramento

emphasized by a smallcap font Semantically they are thus as valid a lo cation for Sacramentos streak as

the last paragraphs rst sentence Syntactically they are also valid lo cations since the relative clause which

realizes Sacramentos streak in the last paragraph could also have b een added as mo dier of either reference

as shown by sentences S and S b elow

1 2

S Sacramento Ca Michael Adams scored a careerhigh p oints Wednesday night including seven

1

p oint baskets to help the shorthanded Denver Nuggets end a vegame losing streak with a victory

over the Sacramento Kings who lost their fourth straight

S Adams who was drafted and then discarded by the Kings who lost their fourth straight four

2

seasons ago made of eld goals including seven of p oint attempts and hit three of four free

throws to break his previous career high of p oints

The preference of the rst sentence of the last paragraph over S can b e explained on discursive grounds

1

Recall from Section that b ecause the corpus rep orts follow an inverted pyramid structure with summary

lead only crucial facts go in the rst sentence Sacramentos streak information was probably not imp ortant

enough to have b een incorp orated in the rst sentence As for the preference of the rst sentence of the last

paragraph over S it can b e explained on stylistic grounds The emb edding of the relative clause realizing

2

the streak inside another relative clause makes S stylistically awkward The example rep ort of Fig

2

thus shows that cho osing b etween alternative rep ort structure lo cations for realizing oating historical facts

requires taking into account not only semantic and syntactic factors but also discursive and stylistic ones

The ability to exibly convey historical information therefore entails the ability to realize oating semantic

elements under a combination of semantic syntactic stylistic and discursive constraints

The fth observation concerning historical information in rep ort generation is that it triggers a combina

torial explosion of the numb er of relevant facts This explosion is due to the fact that the historical context

multiplies the dimensions along which to evaluate the signicance of each input statistic Consider for ex

ample a players scoring statistic in a basketball game Ignoring the historical context there are basically

only two ways in which this statistic can b e signicant when compared to the scoring statistics of all the

other players in the game gamehigh and when compared to the scoring statistics of his teammates only

teamhigh However such a statistic can b e historical ly signicant in a combinatorially explosive numb er

of ways when compared to that players scoring average or high or low over the season or over his entire

career or since he joined his current team etc when compared to the highest scoring p erformance of any

2

The fact that in the corpus historical semantic elements of given typ e are co erced to app ear in one cluster of the lead

sentence and not the other do es not invalidate the oating nature of these elements The clusters internal complexity allows

these elements to app ear at dierent linguistic ranks inside a given cluster

Sacramento Ca Michael Adams scored a careerhigh p oints Wednesday night including seven p oint

baskets to help the shorthanded Denver Nuggets end a vegame losing streak with a victory

over the Sacramento Kings

Adams who was drafted and then discarded by the Kings four seasons ago made of eld goals

including seven of p oint attempts and hit three of four free throws to break his previous career high of

p oints Adams also dished out a gamehigh assists and had ve steals

Ro okie Chris Jackson added p oints and pump ed in and grabb ed reb ounds

for Denver which outscored Sacramento during the nal of the fourth quarter

The Nuggets who had only eight players available for the game improved to on the road and

overall

Ro okie guard Travis Mays playing his second game after missing with back spasms scored a season

high p oints for the Kings who lost their fourth straight Lionel Simmons added p oints and

reb ounds

Figure Streak information in a corpus rep ort

player on his team or on any team in a given division conference etc this season or over the last games

or ever while playing at home or on the road against a particular opp onent or against any opp onent

etc Because of the sheer numb er of historical facts any such fact is likely to b e of similar relevance as

many others Therefore content pro duction results in a much larger set of candidate facts with many more

shades of relative imp ortance This makes content selection much harder General rules based on purely

encyclop edic grounds eg if a player scores more than p oints or if he scores more than anyb o dy in his

team then include his scoring in the rep ort no longer suce to decide which facts to select Other factors

need to b e considered One such factor is whether the realization patterns of the candidate facts can b e

combined in cohesive and stylistically felicitous surface forms The ability to convey historical information

thus entails the ability to perform nal content selection under surface form constraints For example to

decide not to include a complementary fact b ecause incorp orating it would require either generating to o

complex a sentence or to o long a summary

The sixth and last corpus observation concerning historical information is that which typ e of revision

can b e applied to a rep ort draft to add a given historical fact is constrained by the surface form of that

draft This observation has already b een made in Section The rst column of table in Section

indicates that the applicability of most revision to ols is restricted to base clusters with sp ecic syntactic

structures Consider again the basic corpus sentence example given in the intro duction

David Robinson scored p oints Friday night lifting the San Antonio Spurs to a victory over

the Denver Nuggets

The dierent revisions of this base sentence to add the same additional streak fact shown in Fig all

3

result from using two revision to ols conjoin and adjoin Other revision to ols cannot b e used with due

4

to its surface form For example adjunctization which requires the game cluster of the base sentence to b e

headed by a supp ort verb cannot b e used with which is headed by the full verb to lift Adjunctization

can however b e used to add the same streak information on the base sentence synonymous with

yielding synonymous with the sentences of Fig

David Robinson scored p oints Friday night and the San Antonio Spurs rolled to a victory

over the Denver Nuggets

David Robinson scored p oints Friday night and the San Antonio Spurs extended Denvers losing

streak to seven games with a victory over the Denver Nuggets

This example shows that in some cases the base sentence syntactic structure can b e the only discrimi

3

The variety comes from dierent application of these to ols ie at dierent linguistic ranks and in dierent syntactic forms

in each paraphrase

4

Presented in Section

natory factor justifying the choice of one revision to ol over another

The conclusion drawn from the corpus observations listed in the b eginning of this subsection is that the

ability to convey historical information entails the following abilities

Generating very complex sentences

Generating a wide variety of syntactic constructs b elow the sentence rank

Performing nal content selection under surface form constraints

Realizing oating semantic elements at dierent linguistic ranks

Realizing oating semantic elements under a combination of semantic syntactic discursive and stylistic

constraints

From needed abilities to design principles

What architecture design principles can we deduce from the agenda of providing a rep ort generator with the

ve abilities needed for exibly convey historical information identied in the previous section

In the intro duction of this thesis I have already explained that generating a wide variety of syntactic

constructs b elow the sentence rank requires micro co ding sentences down to individual words and collo cations

made of a very few words From an engineering p ersp ective I have also suggested that the need to pro duce

very complex sentences through micro co ding calls for a twopass draft and revision generation mo del

The cognitive research literature also supp orts the hyp othesis that to generate very complex sentences

conveying many facts revision is needed Pavard describ es an exp eriment providing psychological

evidence supp orting the revision mo del for the generation of very complex sentences In this exp eriment

human sub jects were asked to write a single sentence paraphrasing a text of three sentences which together

conveyed eight facts in words They were thus asked to generate a sentence of complexity similar to the

newswire rep ort leads analyzed in Section of this pap er The sub jects were divided into two groups In

one group the sub jects p erformed the exercise using a microphone ie a medium that precludes revision

while in the other they p erformed the exercise using a text editor ie a medium that facilitates revision

For each group two measurements where made

Average p ercentage of semantic elements omitted in the singlesentence paraphrases among those

realized by head and argument constituents in the original text

Average p ercentage of semantic elements omitted in the singlesentence paraphrases among those

realized by adjunct constituents in the original text

The results of this exp eriment are summarized in the table b elow

media omitted heads and arguments omitted adjuncts

microphone

text editor

These results strongly suggest that even humans have trouble generating sentences of such complexity

in one shot They also show that the trouble do es not so much arise for xed semantic elements which

essentially corresp ond to the elements realized by the head and argument constituents in the original text

but sp ecically for oating semantic elements which essentially corresp ond to the elements realized by

adjunct constituents in the original text They thus supp ort the cognitive plausibility of the generation

mo del prop osed here where oating elements are added by revising a draft built around xed elements

I have so far showed that the need for b oth syntactic variety b elow the sentence rank and syntactic

complexity suggests two principles for a new rep ort generation architecture incremental informationadding

revision and micro co ding from a wordbased lexicon Within this framework the three remaining abilities

identied in the previous subsection ie constraining nal content selection by surface form factors realizing

oating semantic elements at dierent linguistic ranks and under a combination of semantic syntactic

stylistic and discursive constraints supp ort two additional principles concerning the draft representation on

which to p erform revision that it include a surface layer reecting realization choices and that it

include a purely conceptual layer indep endent of linguistic form

In the draft and revision generation mo del I prop ose nal content selection o ccurs during revision In

particular it is at that stage that the nal decision to include a particular historical fact in the rep ort is

made Within this framework the ability to constrain nal content selection on surface form factors requires

the use of incremental revision to p erform not only phrase planning but also lexicalization and semantic

grammaticalization The draft representation must therefore include a layer sp ecifying the realization ie

the op enclass lexical items and syntactic form of each fact This need is also supp orted by the corpus

observation that revision to ol applicability is constrained by the surface form of the draft

The need to handle intersentential oating of semantic elements provides an additional justication for

b oth

The draft and revision approach sketched in the intro duction of this thesis and elab orated in the next

section

The representation of realization choices in the draft

At revision time a generator has access to the draft of the entire rep ort If surface form is represented in

the draft the generator can then cho ose where to realize a oating semantic element under a combination of

semantic syntactic stylistic and discursive constraints In particular it can cho ose b etween two candidate

lo cations for realizing a oating semantic element by comparing the resp ective stylistic impacts of attaching

the oating element to either of them even if these lo cations are distant in the text structure This would

b e imp ossible for a onepass architecture where realization is p erformed linearly on a sentencep erse ntenc e

basis With such an architecture when the current sentence is one of the p otential lo cations for realizing a

oating semantic element there is no way to measure cho osing this sentence versus other p otential lo cations

that may lie ahead in the text structure but for which no surface form is yet available

The need to handle intrasentential oating of semantic elements requires that the draft representation

also include a purely conceptual layer that totally abstracts from linguistic form The rst constraint on the

attachment of a oating semantic element onto a draft sentence constituent is semantic the attachment must

b e warranted by some semantic relationship holding b etween the oating element and some base semantic

element realized by the draft constituent The same base element can b e realized at dierent linguistic ranks

in dierent sentences For example the base semantic element realized at the clause rank in phrase X b elow

is realized at the nominal rank in phrase Y

X John Stockton scored points

Y John Stocktons points

Despite their syntactic dierences the same revision can b e applied on b oth these phrases to incorp o

rate a oating semantic element of typ e extrema as shown by X and Y b elow

X John Stockton scored a seasonhigh points

Y John Stocktons seasonhigh points

The revision comp onent can determine that this revision is applicable to b oth these phrases only if the

draft representation captures the meaning identity b etween X and Y This identity can b e captured only

by a purely conceptual representation that totally abstracts from linguistic form A linguistically motivated

semantic representation such as penmans Upp erMo del Bateman et al would view X as a mate

rial action and Y as a p ossessed ob ject two radically dierent semantic categories It would thus fail to

capture their identity of meaning This contrast b etween conceptual and linguistic semantic representations

is further discussed in Section

In this subsection I have shown that providing a rep ort generator with the ve abilities it needs to exibly

convey historical information suggest that its architectural design must reect the four following principles

Incremental informationadding revision

Micro co ding from a wordbased lexicon

Presence in the draft representation of a purely conceptual layer indep endent of linguistic form

Presence in the draft representation of a surface layer reecting realization choices

In the next subsection I present in detail a rep ort generation architecture based on these four principles

A revisionbased architecture for incremental generation

In this subsection I prop ose a new generation architecture based on the four principles dened in the

previous section This new architecture is a general design for any rep ort generator handling historical

5

information in a quantitative domain and it encompasses the full range of rep ort generation subtasks This

thesis fo cuses on three of these subtasks phrase level content selection phrase level content organization

and content realization These three subtasks have b een implemented in the summary generation system

streak Chapter where I present this implementation contains the full details ab out these three tasks

Issues related to generation subtasks whose implementation lay b eyond the scop e of the research presented

in this thesis namely content pro duction and discourse level planning and organization are discussed only

at a high level and in general terms

I start by describing the various levels of representation of an utterance in the architecture I then

describ e the various pro cessing comp onents of the architecture and how they interact to build a rst draft of

the rep ort and then revise it Exactly where each of the generation subtasks dened in Section is carried

out in this architecture is further discussed in the subsequent section

Internal utterance representation

In the architecture I prop ose an utterance is represented at three dierent levels of abstraction

the Deep Semantic Sp ecication DSS

the Surface Semantic Sp ecication SSS

the Deep Grammatical Sp ecication DGS

The architecture thus b elongs to the straticational tradition of computational linguistics cf Dale

and Polguere for other generation systems using a straticational utterance representation scheme

with multiple representation layers each capturing a sp ecic set of regularities The general straticational

scheme I prop ose is sketched in Fig It is based on two assumptions concerning the generation system

The rst is the presence of an interface linking the generator to the underlying application program This

interface p erforms content pro duction In the case of summary rep ort generation from quantitative data

the interface is a fact generator that retrieves interesting data from tables of numb ers and reformats them

as conceptual structures suitable for text generation For other applications this interface may query a

database the trace of an exp ert system an interlingua representation of a text to translate etc The second

assumption is that the generator do es not p erform lowlevel syntactic pro cessing on its own but instead relies

on a standalone p ortable syntactic grammar of the target natural language such as surge Elhadad b

nigel Mann and Matthiessen or mumble Meteer et al for English The three layers dene a

pip eline of internal representations that bridge the gap from the application program interface that p erforms

domainsp ecic languageindep endent conceptual pro cessing to the syntactic grammar comp onent that

p erforms domainindep endent languagesp ecic linguistic pro cessing I describ e and exemplify each layer

in turn in the following subsections

5

These subtasks were dened in Section Application Program Interface

Conceptual Deep Semantic Specification (DSS) Flat Conceptual Network Language−Independent Semantics Syntagmatic Paraphrasing

Surface Semantic Specification (SSS) Structured Semantic Tree Domain−Independent

Paradigmatic Paraphrasing & Grammatical Paraphrasing

Linguistic Deep Grammatical Specification (DGS)

Lexicalized Syntactic Tree Syntax

Syntactic Grammar

Figure Internal representation layers

The Deep Semantic Sp ecication

The Deep Semantic Sp ecication is a at conceptual network It is a partial description of an event or an

ob ject of the underlying application domain like the kind that can b e found in knowledge bases using a

relational representation formalism The no des of the network are concepts and the arcs are role relations

among them An example DSS is given in Fig It represents the result of a basketball game b etween

two teams the Orlando Magic and the Toronto Raptors sp ecifying that the game was won b oth by and in

Orlando In such a DSS network a concept is represented by a rectangle with its name prexed by c and

role relation is represented by an oval with its name prexed by r

This example network illustrates the two key prop erties of a DSS network

It explicitly contains redundant information that can b e deduced by domain reasoning from a minimal

description For example in dss b oth the rwinner and rloser relations are present even though

one could b e deduced from the other by the common sense knowledge that team sp orts involve two

teams and that if one wins a game then the other necessarily loses it Similarly the relation rbeat

b etween b oth teams is also redundant with rwinner and rloser

It is at without any notion of head constituent up or down Any no de or arc in the network can

p otentially serve as head for a linguistic expression of the facts represented by the network

These two prop erties insure that the DSS is totally uncommitted to any particular linguistic realization c−game

r−winner r−host r−visitor r−loser

c−team r−beat c−team

r−home r−franchise r−home r−franchise

Orlando Magic Toronto Raptors

Figure An example DSS dss

This makes the DSS an interface that cleanly separates conceptual domain reasoning from linguistic pro cess

ing the generator comp onents upstream from the DSS are only concerned with domain reasoning and need

not worry ab out linguistic considerations and conversely the generator comp onents downstream of the DSS

are only concerned with linguistic pro cessing and need not p erform any domain reasoning

Each redundant relation in the DSS captures a dierent perspective on the overall event represented by the

DSS Dierent paraphrases in the domain sublanguage describ e the same event from dierent p ersp ectives

Consider for example the following paraphrases describing the content represented by dss

A game in which the Orlando Magic defeated the Toronto Raptors at home describ es the event from

the p ersp ective of the rbeat relation which is lexicalized by the verb to defeat

The Orlando Magic won their home game against Toronto Raptors describ es the same event from

the p ersp ective of the rwinner relation which is lexicalized by the verb to win

The Toronto Raptors lost the road game at the hand of the Orlando Magic also describ es the same

event but from the p ersp ective of the rloser relation which is lexicalized by the verb to lose

Note that in each paraphrase a dierent concept is put in fo cus by app earing rst in the sentence This

illustrates that such paraphrasing p ower is not only needed for the sake of variety always using the same

linguistic form immediately b etrays the articial nature of a generated text but also to satisfy discourse

constraints such as fo cus shift rules McKeown insuring that the sentence coherently inserts itself in

the overall generated text If only rwinner was present in the DSS either it would b e imp ossible to

put either the game itself or the losing team in fo cus as in sentences and ab ove or the comp onents

downstream would themselves need to infer rbeat and rloser from rwinner With this last option the

DSS would then fail to circumscrib e domain reasoning to the application program interface

This interface is thus resp onsible for providing in the DSS all the asp ects of an entity that are describ ed

in any one of the domain sublanguage expressions referring to this typ e of entity The linguistic comp onents

downstream of the DSS then decide which asp ects to include explicitly in a given expression of the DSS

content and which to leave implicit This issue of implicit vs explicit content realization is further discussed

for example DSSs in the next section It is in itself a vast issue that has b een the ob ject of several dissertations

In this thesis I consider it only from the p ersp ective of generating paraphrases in the context of written

rep ort pro duction for restricted quantitative domains where the readership is assumed uniform Therefore

I do not discuss its relation to issues such as usermo deling Paris conversational implicatures Reiter

or argumentation Elhadad b For application domains where these issues are crucial the DSS

could not b e assumed to include all the relevant asp ects of an entity to describ e Instead it would need

to consist only of a default description sucient for the most common situations For the other situations

the linguistic comp onent would need the ability to request more information to the application program

interface whenever such information is needed to cho ose b etween a particular set of linguistic options This

issue of enriching the DSS on demand during linguistic realization and the implementation of a facility for

such pro cess within the generation framework of functional unication is discussed in detail in Elhadad and

Robin During the implementation of the generator streak I discovered that for the particular sp orts

game summarization application of this system this facility was not really needed and thus did not make

use of it Nevertheless I remain convinced that in other domains such a facility could b e essential

The approach to mo del the three paraphrases ab ove taken in this thesis is to consider them as primarily

resulting from dierent conceptual perspective choices They do constrain lexical choices but are nonetheless

kept separate Persp ective is chosen while mapping the DSS into an SSS Only when mapping this SSS into an

DGS do es a sp ecic word get picked among those whose argument structure is compatible with the chosen

p ersp ective This approach also separates p ersp ective choice from the domain reasoning necessary to sp ecify

the range of p ersp ectives this is done while building the DSS This separation makes the implementation of

each task much easier and also increases the p otential for p ortability of each knowledge source An alternative

way to lo ok at the three paraphrases ab ove is to consider them as primarily resulting from dierent lexical

choices in the case at hand cho osing among the verbs to defeat to win and to lose In terms of

generator design such a view obviates the need for redundant information in the DSS The three dierent

words would then b e considered alternative lexicalizations of a unique concept for example rloser In

order to generate sentence ab ove this approach would require the lexicalizer to simultaneously

Reason that the host of a game whose visitor is also the loser must b e the winner of that game

Cho ose rloser as the head concept of the sentence but at the same time fo cus on the winning team

Cho ose to win as opp osed to to crush to rout to whip and the hundred or so verbs which

allow to realize rloser while fo cusing on the winning team

This alternative view thus has the disadvantage of burdening the lexicalizer comp onent with domain

reasoning and syntagmatic choices in addition to the paradigmatic choices which are its sole resp onsibility

in the architecture prop osed here These paradigmatic choices can b e subtle For example verbs such as to

crush or to whip can b e used only for a restricted set of nal scores Dealing with this issue separately

from p ersp ective and syntagmatic choices seems the b est option

The atness of the DSS is just as imp ortant as its p otential redundancy Note for example the contrast

b etween sentences and in the ab ove paraphrase The rst is an NP whose head constituent realizes

the cgame concept of dss while the second is a clause whose head constituent realized the rwinner

relation of dss In many generators the linguistic comp onents are presented with treestructured input in

which the semantic element that will constitute the head of the linguistic expression is already implicitly

chosen Typically this input consists of a thematic or case role structure describing an action Because

these thematic roles are linguistic abstractions there are constraints on which syntactic category can ll

them With such an input the linguistic comp onents therefore have very restricted freedom for cho osing

the p osition and category of each semantic element in the syntactic structure They generally map the top

level action into a clause following the input thematic structure and recursively map each role ller to an

NP If the DSS was already structured as a thematic role structure either each typ e of entity would

always b e describ ed by the same thematic structure and syntactic category thus considerably limiting the

paraphrasing p ower of the generator or the choice of thematic role structure would fall back to the

application program interface With this last option the DSS would then fail to shield this interface from

linguistic considerations

In short the DSS must b e a at partially redundant conceptual network whose function is to abstract

from linguistic form by capturing the common recoverable meaning shared the all synonymous phrases in a

given sublanguage

The Surface Semantic Sp ecication

The Surface Semantic Sp ecication is a semantic tree It still purely semantic in the sense that it do es

not sp ecify any particular syntactic category or lexical item for the utterance to generate However its

tree structure captures the organization of content inside a given class of linguistic structure It sp ecies

how semantic elements are to b e group ed together inside linguistic constituents and which element should

head the group In addition to sp ecifying constituency and dep endency an SSS tree also diers from a

DSS network in that it contains only those semantic elements to realize explicitly in the linguistic structure

Finally in contrast to the DSS that is domainsp ecic and language indep endent the SSS is linguistically

6

motivated and domainindep endent

The arcs of an SSS tree are general linguistically motivated rhetorical relations and thematic roles There

are two dierent typ es of no des in an SSS tree

Encyclop edic no des realizing a concept or a relation of the corresp onding DSS viewed as a memb er of

a particular ontological class

Rhetorical no des which do not realize any particular element of the corresp onding DSS but which in

stead function as an aggregation medium to group several such elements within a linguistic constituent

The top level ontological classes used as p ersp ective through which to present a semantic element are

event individual set place time quality and quantity They act as preselectors of the syntactic

category that will ultimately b e used to express the semantic element By default when an element is viewed

as an event it will b e expressed by a clause when viewed as a individual or a set it will b e expressed by an

NP when viewed by a place or time by a PP when viewed as a quality as an adjective or adverb etc

There are three typ es of rhetorical aggregates

Hypotactic complexes with a head and dep endents all realizing some semantic element in the corre

sp onding DSS but with these elements not coming from a neatly delimited subnetwork in the DSS

Complex clauses with adverbial adjuncts are all represented by hyp otactic complex at the SSS layer

Paratactic complexes without a head but in which all elements share the same structural status

Conjunctions and app ositions are represented by paratactic complexes at the SSS layer

Rhetorical event structures with a head which do es not in itself realize any element in the corresp onding

DSS but only aggregates sub constituents who do Clauses headed by supp ort verbs are all represented

by such rhetorical events at the SSS layer

Figures and show three examples of SSS trees Each tree corresp onds to a dierent class

of syntagmatic paraphrases of the content represented by dss In these gures encyclop edic no des are

prexed by E and rhetorical ones by R as in the DSS atomic elements have no prex

sss is a linguistic structure plan where the rbeat concept is chosen as the head with an event p er

sp ective It views the situation describ ed in dss as the wining team b eing an agent whose action aects

the losing team and o ccurs in a lo cation where the winning team is host It also indicates that the rhome

prop erty of each team must used to refer it The elements of dss that are explicitly realized in sss are

indicated by a b oldface font in Fig sss represents synonym phrases such as

Orlando defeating Toronto at home

Orlando triumphed over Toronto in its building

In contrast in sss it is the concept rwinner that is chosen as the head and with an individual p er

sp ective It views the same situation describ ed in dss as an asset of the winning team obtained at the

exp ense of the losing team The lo cation where this ob ject was obtained is itself seen as an individual sss

also indicates that this time the name of each team should b e used to refer to them The elements of dss

that are explicitly realized in sss are indicated by a b oldface font in Fig sss represents synonym

phrases such as

The Magics homecourt win against the Raptors

The home victory of the Magic against the Raptors

6

At least for the part of a sublanguage that can b e considered domain indep endent In very sp ecialized technical sub

languages a nonnegligible prop ortion of the encountered syntactic constructs may b e idiosyncratic to the domain E−beat E−winner as event as individual

agent affected location possessor opposition location

E−team E−team E−host E−team E−team E−host as individual as individual as place as individual as set as individual

name name name name

Orlando Toronto Magic Raptor

Figure sss left and sss right two dierent p ersp ectives on dss

c−game

r−winner r−host r−visitor r−loser

c−team r−beat c−team

r−home r−franchise r−home r−franchise

Orlando Magic Toronto Raptors

Figure Content units of dss explicitly realized in sss

c−game

r−winner r−host r−visitor r−loser

c−team r−beat c−team

r−home r−franchise r−home r−franchise

Orlando Magic Toronto Raptor

Figure Content units of dss explicitly realized in sss and sss

Finally in sss none of the element from dss is chosen as head Instead the head is a rhetorical no de

allowing the combining within an event structure of parts of the dierent p ersp ectives resp ectively repre

sented by sss and sss Like sss sss views the winning team as an agent but it otherwise explicitly

realizes the same elements of dss as sss and with the same p ersp ective It represents synonym phrases

as

The Magic claimed a home triumph over the Raptors

The Magic p osted a homecourt victory over the Raptors

The contrast b etween these three example SSSs illustrates the onetomany and nonisomorphic nature of

the mapping from conceptual structure to linguistic structure even indep endently of lexical considerations

The Deep Grammatical Sp ecication

The Deep Grammatical Sp ecication is a lexicalized syntactic tree Each no de corresp onds to a syntactic

constituent of the sentence to generate and sp ecies its category lexical head and syntactic features each

category is asso ciated with a set of relevant features eg deniteness for NPs mo o d for clauses It is

deep in the sense that the arcs linking a clause to its sub constituents are still thematic roles like agent

or affected instead of surface syntactic roles like subject or object and only op enclass words like

verbs nouns and adjectives are sp ecied Since the DGS is the input to a p ortable standalone syntactic

grammar comp onent it thus assumes that b oth the mapping from thematic roles to surface syntactic roles

and the choice of closedclass words like articles and pronouns is carried out by this syntactic grammar from

7

the syntactic features of the DGS The syntactic grammar also inects op enclass words while enforcing

agreement determines precedence constraints among constituents and words and nally outputs the stream

of words satisfying these constraints

The reason for the DGS to b e structured in terms of thematic roles at the clause level is that the two most

widely used p ortable syntactic grammars in generation research to day surge and nigel are b oth based on

the systemic linguistic framework Halliday Fawcett which include a semantic analysis of the

clause in terms of a set of thematic roles The advantage of such high level input to a syntactic grammar is

to maximize the numb er of generation subtasks p erformed by this reusable comp onent Its drawbacks are

discussed in detail in Section B of App endix B One is an asymmetry in the resp ective representations of

internal structure of clauses and nominals Systemic linguistic has not yet come up with a set of thematic

roles for analyzing nominals similar to the set of roles it has develop ed to analyze clauses The extreme

semantic versatility of nominals makes their analysis in terms of thematic roles a daunting if intriguing task

In the current systemic framework nominals are thus analyzed in terms of roles like describer classifier

and qualifier cf Section B of App endix B for their precise denition which are more sup ercial that

thematic roles and corresp ond more to the syntactic roles subject object complement etc of the clause

The tree structure of the DGS is very similar to that of the corresp onding SSS The main dierence

o ccurs when an arc of the SSS is realized lexically in the sentence to generate In this case it corresp onds

to a node of the DGS instead of an arc This dierence is in part ro oted in the asymmetry of the DGS layer

for clauses and nominals just evoked An example of such nonisomorphism b etween DGS and SSS is given

on the left side of Fig showing dgs the DGS lexicalizing sss by the phrase

The Magics triumph over the Raptors

The opposition arc in sss is realized lexically by the prep osition over in dgs intro ducing an additional

constituent

Like the mapping from DSS to SSS the mapping from SSS to DGS is one to many in addition to b eing

isomorphic For example the right side of Fig shows dgs an alternative lexicalization of sss by the

phrase

The Magics homecourt win against the Raptors

7

Prep ositions are the exception Even though a closed class they cannot b e chosen on purely syntactic grounds as shown by

Herskovits I thus assume that the syntactic grammar can only provide a default choice which need to b e overwritten in

the DGS of utterances for which this default is not appropriate on semantic grounds R−event1

agent range

E−team E−winner as individual as individual

name opposition location

Magic E−team E−host as set as individual

name

Raptor

Figure sss a third p ersp ective on dss including a rhetorical event

cat = NP cat = NP lex head = ‘‘triumph’’ lex head = ‘‘win’’

possessor classifier qualifier possessor classifier qualifier

cat = NP cat = noun cat = PP cat = NP cat = noun cat = PP lex head = ‘‘Magic’’ lex = ‘‘home’’ lex head = ‘‘over’’ lex head = ‘‘Magic’’ lex = ‘‘homecourt’’ lex head = ‘‘against’’

complement complement

cat = NP cat = NP lex head = ‘‘Raptor’’ lex head = ‘‘Raptor’’

number = plural number = plural

Figure dgs left and dgs right two dierent lexicalizations of sss

cat = clause lex head = ‘‘defeat’’

agent affected location

cat = proper cat = proper cat = PP lex head = ‘‘Orlando’’ lex head = ‘‘Toronto’’ lex head = ‘‘at’’

complement

cat = noun

lex = ‘‘home’’

Figure dgs an example lexicalization of sss cat = clause lex head = ‘‘post’’

agent range

cat = NP cat = NP lex head = ‘‘Magic’’ lex head = ‘‘victory’’

classifier qualifier

cat = noun cat = PP lex = ‘‘home’’ lex head = ‘‘over’’

complement

cat = NP lex head = ‘‘Raptor’’

number = plural

Figure dgs an example lexicalization of sss

In general a DGS represents more than one phrase For example dgs in Fig represents the lexi

calization of sss by the phrases

Orlando defeating Toronto at home and

At home Orlando defeated Toronto in addition to

Orlando defeated Toronto at home

For the sake of completeness Fig gives the DGS lexicalizing sss by the phrase

The Magic p osted a home victory over the Raptors

Draft representation

All three representation layers are needed in the draft representation Architectural principles and

dened in Section concerned the draft representation One required the presence of a purely conceptual

layer indep endent of linguistic form This is the DSS The other required the presence of a surface layer

reecting realization choices This is the DGS Finally as the intermediary layer that bridges the gap b etween

these two endlayers the SSS is also needed

The draft representation is thus a triple DSSSSSDGS Each of the last two elements in this triple

is a complex structure with several constituents linked by various dep endency relations Since the mapping

b etween any two of these layers is b oth onetomany and nonisomorphic the draft representation must also

include explicit markers of the corresp ondence b etween the constituents of a given layer and the constituents

of the next layer Revising a syntactic structure without explicit knowledge of what semantic element each

of its constituent realizes could have undesirable sideeects such as deletion of part of the original semantic

message intro duction of ambiguities etc An example of a threelayered draft representation is given in

Fig The SSS arcs mark the corresp ondence b etween DGS and SSS constituents while the DSS arcs

mark the corresp ondence b etween SSS constituents and DSS elements This particular threelayer structure

represents phrases like The Magics home triumph over the Raptors

For the sake of simplicity only the most imp ortant features of the draft representation are shown in this cat = NP lex head = ‘‘triumph’’

possessor classifier qualifier

cat = NP cat = noun cat = PP lex head = ‘‘Magic’’ lex = ‘‘home’’ lex head = ‘‘over’’

SSS SSS SSS complement

cat = NP lex head = ‘‘Raptor’’ number = plural

SSS

E−winner as individual

DSS possessor location opposition

E−team E−host E−team as individual as individual as set

name name

DSS Magic DSS Raptor DSS

c−game

r−winner r−host r−visitor r−loser

c−team r−beat c−team

r−home r−franchise r−home r−franchise

Orlando Magic Toronto Raptors

Figure An example of threelayered draft representation

gure In the practice of implementation additional features are needed to exactly sp ecify the phrase ab ove

as the current draft These additional features are presented in Section in the chapter describing the

implementation of the streak system

Advantages of the threelayer scheme

In a generation system utterances must b e represented by at least two layers a semantic layer used for

planning purp oses and a syntactic layer used for syntactic purp oses In the representation scheme prop osed

8

here there are two semantic layers the DSS used for text planning purp oses and the SSS which bridges the

gap b etween the DSS and the DGS The need for a representation that bridges the gap b etween conceptual

and syntactic pro cessing has b een advo cated in detail by Meteer For alternative schemes using a

single semantic representation there are two p ossibilities either the single semantic layer is like a DSS or it

is like an SSS

If it is like an SSS the domain knowledge of the generator must b e enco ded in terms of linguistically

motivated categories This is the approach advo cated by penmans Upp erMo del Bateman et al

where the facts and entities of the domain knowledge base are viewed as instances of domainindep endent

categories dened in terms of linguistic ranks and thematic roles The drawback of this approach is that

it makes problematic the representation of oating facts at a variety of linguistic ranks and by a variety of

thematic role structures

Consider again the two paraphrases b elow

X Orlando defeated Toronto at home

Y The Magics home triumph over the Raptors

In the Upp erMo del the meaning of these two sentences would have to b e represented as two distinct

facts instances of two dierent domainindep endent categories material action and p ossessed ob ject With

such an approach the only way to account for paraphrases like the two ab ove is to articially intro duce

in the domain knowledge base fact duplicates corresp onding to the dierent realization p ersp ectives In

contrast with the three layer scheme prop osed here the common meaning of such sentences is captured by a

single DSS which is in then mapp ed onto dierent SSSs each representing a dierent realization p ersp ective

X and Y thus share the same DSS dss in Fig but have dierent SSSs sss in Fig for X and sss

in for Y in the same gure The key advantage of having two semantic representations is that each one

assumes a single task the DSS abstracts from linguistic form whereas the SSS abstracts from the domain

Paraphrases that cut across linguistic ranks like the two ab ove highlight the inherent conicts b etween these

two tasks which the Upp erMo del approach attempts to reconcile within a single representation

Note that the very idea of an Upp erMo del is not questioned by this observation Having a set of

general concepts under which to attach the particular concepts of a given domain is extremely helpful for

developing a conceptual domain hierarchy The problem arises with overusing linguistic criteria and in

particular linguistic rank and thematic roles for dening these general concepts The temptation for such

overuse is great b ecause there are not many nonlinguistic criteria to fall back on for dening an Upp er

Mo del see Lenat and Guha for an attempt to build a nonlinguistically motivated general ontology

However semantic linguistic categories are b est viewed as constituting a surface p ersp ective mo del rather

than an upp er mo del for the domain

Without a DSS a system would have diculty handling paraphrases that cut across linguistic ranks

Systems with a single semantic representation that is like a DSS avoid this problem However such systems

must p erform a direct mapping from the DSS onto the DGS This mapping is a complex task for the three

following reasons

It involves several generation subtasks phrase level content planning lexicalization and semantic

grammaticalization

It is one to many

It is nonisomorphic

8

Sentence planning o ccurs during the mapping from the DSS to the SSS

The entire paraphrasing p ower of a sublanguage is captured by the mapping from a single DSS onto a

whole set of synonymous DGSs There are however three distinct asp ects to that paraphrasing p ower

Syntagmatic paraphrasing which involves cho osing b etween synonymous word combinations that are

asso ciated with dierent syntactic categories andor argument structures eg Orlando defeated

Toronto vs Orlandos triumph over Toronto

Paradigmatic paraphrasing which involves cho osing b etween synonymous word combinations for a

xed syntactic category and argument structure eg Orlando defeated Toronto vs Orlando beat

Toronto

Grammatical paraphrasing which involves cho osing b etween synonymous surface forms with the same

op enclass words but dierent syntactic prop erties eg Orlando defeated Toronto vs Orlando

defeating Toronto

In addition the mapping from DSS to DGS is in general nonisomorphic as already mentioned in sec

tions and This nonisomorphism is well illustrated in Fig containing all three repre

sentation layers for the same phrase Nonisomorphism b etween semantic and syntactic structure has b een

noted in many domains cf Talmy Talmy Zo ck and has multiple asp ects several

semantic elements can b e conated into a single syntactic element some syntactic elements may b e present

for purely grammatical reasons without corresp onding to any of the semantic elements the syntactic element

realizing the semantic head can b e deeply emb edded in the syntactic structure while the syntactic head may

realize a semantic element emb edded in the semantic structure etc

The main advantage of having an SSS layer is that it breaks down the overall complexity of mapping

a DSS onto a DGS into two stages each involving a smaller more manageable numb er of decisions In

particular

Dierent generation subtasks are handled at dierent stages phrase level planning is handled dur

ing DSSSSS mapping while lexicalization and semantic grammaticalization are handled during

SSSDGS mapping

Dierent asp ects of paraphrasing are captured by dierent stages syntagmatic paraphrasing is cap

tured by the DSSSSS mapping while paradigmatic and grammatical paraphrasing are captured by

the SSSDGS mapping

Dierent sources of nonisomorphism are handled at dierent stages

Another use for a double semantic representation is to distinguish b etween implicit and explicit realization

of content as already discussed in Section Dale adopted a double semantic representation in

epicure sp ecically for such purp ose

Pro cessing comp onents and overall control

As shown in Fig there are six comp onents in the new generation architecture I prop ose the fact

generator the discourse planner the phrase planner the lexicalizer the reviser and the syntactic grammar

Generation pro ceeds in two passes During the rst pass an initial draft is built containing only the xed facts

During the second pass oating facts including historical facts are incrementally incorp orated by tting

them opp ortunistically within the current draft The pro cessing comp onents and internal representations

involved in the draft construction are shown in gure Those involved in the draft revision and the nal

9

natural language output are shown in gure

The rst pass starts up on reception of a set of new statistics to rep ort in tabular form This table

10

of numb ers is read by the fact generator which creates a record for each of them Dep ending on the

9

Note how several pro cessing comp onents and internal representations play a role in b oth passes

10

Or part of them in a domain where the irrelevance of some facts can b e established without considering the domains

historical background New Statistics

Fact Generator Historical Statistics Database

Discourse Planner

Phrase Planner Lexicalizer

Fixed Facts

First Flat Conceptual Nets First Semantic Trees First Skeletal Lexicalized Syntactic Trees IDSS ISSS IDGS

First Draft

Stack of ADSS Floating Reviser Facts

Final Draft

Final Flat Conceptual Nets Final Semantic Trees Final Skeletal Lexicalized Syntactic Trees ODSS OSSS ODGS

Syntactic Grammar STREAK Implementation

Natural Language Summary

Figure A revisionbased generation architecture overall picture

implementation this record could b e a lisp or C structure or as in the case of ana an ops working

memory element The fact generator then queries a database of historical statistics to retrieve for each

new statistic a set of related historical statistics In the sp orts domain access to such historical databases

through a mo dem is oered by several companies that sp ecialize in such services This historical data is

used to assess the relevance of the new data and viceversa This allows the fact generator to select a set of

candidate facts b oth new and historical to p otentially include in the natural language summary It then

takes the record representing each fact and creates a DSS for each of them This second reformatting task

involves deducing the redundant information needed in the DSS for the generator to b e able to p erform

crossranking paraphrasing The need for such additional information for language generation a p ervasive

problem in generation cf McKeown and Swartout was explained in the previous section For

example in dss in Fig of the previous section the relations rloser and rbeat can b e deduced from

records for the relations rwinner rhost and rvisitor The fact generator outputs a list of a set of facts New Statistics

Fact Generator Historical Statistics Database

Discourse Planner

Phrase Planner Lexicalizer

Fixed Facts

First Flat Conceptual Nets First Semantic Trees First Skeletal Lexicalized Syntactic Trees IDSS ISSS IDGS

First Draft

Stack of Floating Facts Syntactic Grammar

Natural Language Summary

Figure A revisionbased generation architecture draft time

Each set of facts contains a new fact and p ossibly some related historical facts Each fact is represented

by a DSS augmented by a relevance grade The fact generator is the only comp onent in this architecture

that is particular to quantitative data summarization For other classes of applications another typ e of

underlying application program interface would b e needed to pro duce the corresp onding list of set of DSSs

The fact generator is out of the scop e of this thesis and has therefore not b een implemented in the prototyp e

system streak For such an implementation it would probably b e b est to further decomp ose it into several

subcomp onents one which reads the input table and pro duces fact records one which interfaces with the

historical database one which gives a relevance grade to each new or historical fact and one which pro duces

a DSS for each preselected fact

The output of the fact generator is passed to the discourse planner The discourse planner traverses a

textual schema McKeown enco ding for each sentential slot in the rep ort which typ e of xed facts

it must contain Recall from Section that xed facts are those present in every rep ort and in the same

sentence over the humanwritten mo del rep ort corpus They are contrasted with oating facts which are

present only in some rep orts and in dierent sentences over of the humanwritten mo del rep ort corpus In

the basketball domain all historical facts are oating The most imp ortant new facts are xed but the

secondary ones are also oating For each sentential slot the discourse planner searches for the xed facts

that the schema indicates for that slot in the list of set of facts it received in input For example in the

basketball domain such a schema would instruct the discourse planner to lo ok for four xed facts for the

lead sentence slot the individual statistic with the higher relevance grade the result of the game its lo cation Fixed Facts First Flat Conceptual Nets First Semantic Trees First Skeletal Lexicalized Syntactic Trees IDSS ISSS IDGS

First Draft Phrase Planner Stack of ADSS Floating Reviser Facts Lexicalizer Final Draft

Final Flat Conceptual Nets Final Semantic Trees Final Skeletal Lexicalized Syntactic Trees ODSS OSSS ODGS

Syntactic Grammar

Natural Language Summary

Figure A revisionbased generation architecture revision time

and its date The desired numb er of sentences of the summary to generate is given as an input parameter

constraining the textual schema traversal When the discourse planner reaches the end of the instantiated

schema the facts in its input list that did not t in any xed slot are put into a stack of oating facts and

ordered by decreasing relevance grade The output of discourse planner is thus twofold

A list of Initial DSSs IDSS representing the xed facts for each sentential slot These facts must b e

included in the rst draft

A stack of Additional DSSs ADSS representing the oating facts for the whole rep ort These facts

are to b e incorp orated opp ortunistically in the nal draft and are passed to the reviser

Only the rst part of the discourse planner output is used at draft time For each at conceptual network

in this IDSS the phrase planner is called and maps this network onto a corresp onding semantic tree which

represents highlevel sentence structure In the pro cess it cho oses

Which element of the conceptual network will b e explicitly realized by a linguistic element in the

corresp onding sentence and which to leave implicit

The dep endency relations among the elements chosen for explicit realization

The result is a list of Initial SSSs ISSS The lexicalizer is then called on each semantic tree in this ISSS

and maps it onto a corresp onding lexicalized syntactic tree In the pro cess it cho oses for each linguistic

constituent

Its op enclass words

Its syntactic category

The syntactic features that dier from the set of defaults provided by the syntactic grammar for that

category

The result is a list of Initial DGSs IDGS An input switch parameter can b e used to set the generator

to an incremental mo de in which each intermediate draft is displayed If this switch is on the syntactic

grammar is called on each element in IDGS and an initial draft gets generated Each sentence in this initial

textual draft is internally represented by a DSSSSSDGS triplet The list of these triplets together with

the ADSS stack constitutes the output of the drafting pass and the input to the reviser

For each draft sentence the reviser p ops an ADSS from the stack of oating facts and attempts to

incorp orate it to the sentence using the revision op erations presented in Section These op erations are

generic and only sp ecify where ie on which sub constituent and how ie using what typ e of syntactic

attachment to add the new constituent realizing the ADSS inside the draft sentence For determining the

internal content organization wording and syntactic form of this new constituent the reviser do es not rely on

revision rules It instead calls the phrase planner and the lexicalizer to carry out those tasks in the context

of the revision as indicated by the arrows linking the reviser to the phrase planner and the lexical cho oser

in Fig

Domain sp ecic stylistic and space constraints monitor the revised draft After each revision increment

the DGS of the revised draft is examined to determine whether it has reached the maximum complexity

observed for this sentence slot in the corpus of humanwritten mo del summaries For example in the bas

ketball domain such constraints indicate that the lead sentence should not exceed words in length and

levels of constituent emb edding in syntactic depth When such thresholds are reached the reviser pro

ceeds to the next draft sentence The ADSSs that were p opp ed from the stack during the revision of the

previous sentence but which could not t there due to either lack of a matching revision rule or lack of space

are then pushed back onto the oating fact stack When the generator is called in incremental mo de the

syntactic grammar is called on each revised DGS and the entire revised draft is displayed anew Otherwise

the syntactic grammar is called only once for each rep ort sentence after the entire revision pro cess has b een

completed and the intermediate drafts are not displayed

Generation subtasks revisited

In this subsection I explain how the generation subtasks dened in Section are distributed among the

various pro cessing comp onents of the architecture prop osed in Section and briey discuss the general

applicability of the new architecture

In the new language architecture presented in the previous section

Content production is carried out by the fact generator

Content selection is carried out in two rounds The rst round is p erformed under encyclop edic domain

constraints by the fact generator The second round is p erformed under syntactic lexical and space

constraints by the reviser

Discourse level content organization is carried out by the discourse planner

Phrase level content organization is carried out by the phrase planner It thus o ccurs in two rounds a

rst round when the phrase planner is called on the initial IDSS and a second round when it is called

from the reviser

Lexicalization is carried out by the lexicalizer and similarly o ccurs in two rounds one at drafttime

and one at revisiontime

Semantic grammaticalization is carried out in part by the phrase planner and in part by the lexicalizer

It this also o ccurs in two rounds one at drafttime and one at revisiontime

Morphosyntactic grammaticalization and linearization are b oth carried out by the syntactic grammar

The new generation architecture presented in the previous section was primarily motivated with one class

of language generation application in mind summarizing quantitative data However only one comp onent in

this architecture is sp ecic to this class of applications the fact generator For other classes of applications

another typ e of underlying application program interface would b e needed for content pro duction ie for

generating the DSSs input to the rest of the system The rest of the architecture is general and could b e

used for any text generation system It allows for more exibility and comp ositionality in the generation

pro cess but it is also more complex than most previously prop osed architectures What are the general

circumstances in which these improvements to the generation pro cess are really needed I b elieve any one

of the following circumstances demands in itself either the added exibility or the added comp ositionality

provided by this new architecture

The application domain sublanguage contains b oth very complex sentences and a large numb er of

paraphrasing forms for them

Stylistic surface form constraints eg space limitation need to b e tightly monitored

11

A large prop ortion of the application domain concepts are oating and of fairly even relevance

All the concepts from at least one imp ortant content class are of oating nature

Summary

In this section I have presented a new system architecture with the ability to generate very complex sentences

comp ositionally and to convey historical information in the exible and concise way it app ears in human

written summaries With this architecture generation pro ceeds in two passes The rst pass builds an initial

rep ort draft organizing and realizing xed semantic elements The second pass incrementally revises this

draft to opp ortunistically incorp orate oating semantic elements in particular historical facts Internally

the draft is represented at three levels of abstraction The most abstract level is purely conceptual At the

next level domain concepts are mapp ed to general semantic linguistic categories These abstract linguistic

resources are then mapp ed to sp ecic lexical items and syntactic forms at the next level The natural

language rep ort is then pro duced by passing this last level to a p ortable syntactic grammar

11

ie are mentioned in only a subset of the mo del text corpus and in dierent sentential slots in across this corpus

Chapter

Implementation prototyp e

the STREAK generator

Goals and scop e of the implementation

In this chapter I present the implementation of the language generator streak Surface Text Reviser Ex

pressing Additional Knowledge This implementation has two main goals demonstrating the op erationality

of the revision to ols extracted from the corpus analysis presented in Chapter and demonstrating the prac

ticality of the new generation system architecture prop osed in Chapter streak is a research prototyp e

intended as a testb ed for the new draft and revision approach to language generation advo cated in this

thesis not a nished pro duct intended for everyday use in a realworld application In order to keep the

implementation eort to a manageable size streak fo cuses on

The core of the domain sublanguage ie the lead sentences that were the ob ject of the systematic

indepth analysis presented in Chapter

The most original comp onents of the complete text generation architecture prop osed in Section ie

those involved in the revision pass of the overall generation pro cess

Recall from gures and that two comp onents in that architecture are used in b oth the initial

draft pass and the subsequent revision pass the phrase planner and the lexicalizer They are called b oth at

draft time to realize the xed facts and again at revision time to realize the oating facts In addition to the

complete revision pass the surface realization part of the draft pass is thus also implemented in streak

Input to streak is handco ded and constitutes of

A initial DSS IDSS representing all the xed facts of the lead sentence to generate

A list of ADSSs by order of decreasing imp ortance with each element in the list representing a oating

facts to attempt to opp ortunistically incorp orate to the lead sentence

streak generates either one or a set of synonymous complex sentences concisely expressing all the facts

in the IDSS plus as many facts from the ADSS list that could b e t in without exceeding the maximum

word length or syntactic depth observed in the corpus lead sentences

streak is implemented using the fufsurge package Elhadad a Elhadad b for developing

language generation application This package is presented in detail in App endix B It consists of

fuf Elhadad b Elhadad a a sp ecialpurp ose programming language for text generation

based on functional unication

surge a widecoverage grammar of English implemented in fuf and usable as a p ortable frontend

for syntactic pro cessing

fuf is the formalism part of the package a language in which to enco de the various knowledge sources

needed by a generator surge is the data part of the package one already enco ded knowledge source

usable by any generator Using the fufsurge package implementing a generation system thus consists of

decomp osing nonsyntactic pro cessing into subpro cesses and enco ding in fuf the knowledge sources for each

of these subpro cesses In the case of streak nonsyntactic pro cessing is decomp osed in phrase planning

lexicalization and revision streak thus relies on three nonsyntactic knowledge sources the phrase planning

rule base the lexicalization rule base and the revision rule base These three knowledge sources are enco ded

in fuf

Both fuf and surge had to b e extended during the development of streak to meet some of its sp ecial

needs First the nonmonotonicity of streak required the implementation of new fuf op erators to cut

and paste functional descriptions aside from unifying them These extensions to fuf were implemented by

1

Elhadad Second while surge already covered a wide variety of syntactic forms for simple clauses it only

covered a very few for complex sentences which aggregate several such clauses It also did not cover the

sp ecialized nominals of quantitative domains I implemented sizeable extensions to surge to attain wide

coverage for complex sentences and quantitative nominals This set of extensions go es far b eyond what was

needed for the sole needs of streak and constitutes in itself a signicant though not central contribution of

this thesis It was not simply an implementation matter but required to crossexamine the descriptive work of

several noncomputational linguists and integrate their resp ective analysis within the unifying computational

framework of surge For quantitative nominals some constructs I observed in newswire rep ort corp ora were

not mentioned in the linguistic literature and I had to come up with my own analysis The version of surge

resulting from these extensions has since b een used for other generation applications in addition to streak

automated do cumentation for the activity of telephone network planning engineers Kukich et al

verbal descriptions of visual scenes Ab ella and generation of business letters

The two sets of extensions to the fufsurge package are describ ed in detail App endix B They were

essentially preparatory work paving the way for development of streak as a prototyp e for a particular

application The present chapter describ es this development itself To that purp ose I rst come back to

each layer of internal draft representation abstractly dened in Section and show how it is enco ded in

2

detail as a fuf description I then present in turn the fuf implementation of each pro cessing comp onent

prop er to the streak generator ie the phrase planner the lexicalizer and the reviser Finally I comment

in detail on the trace of two runs of streak on selected examples more selected examples runs are discussed

in App endix C

The internal representation languages

In streak a natural language utterance is sp ecied at three dierent layers of representations

The Deep Semantic Sp ecication DSS which is a at conceptual network

The Surface Semantic Sp ecication SSS which is a structured semantic tree

The Deep Grammatical Sp ecication DGS which is a skeletal lexicosyntactic tree

These three layers were describ ed at the abstract design level in Section The DGS is describ ed at

the detailed implementation level during the presentation of surge in sections B and B of App endix B

The present section describ es the DSS and SSS at the detailed implementation level

1

In fact as wide as any other available syntactic pro cessing frontends for language generation

2

Since fuf is a declarative language with little syntax the examples of the next section should b e understandable in an

intuitive way However for reader interested in grasping them in the details of the implementation a selfcontained intro duction

to fuf can b e found in Section B of App endix B It contains all the fuf syntax and sp ecial features app earing in the examples

of this thesis c−team r−beat c−team

r−home r−franchise r−home r−franchise

Orlando Magic Toronto Raptors

Figure A very simple conceptual network

The DSS language

As explained in Section at the abstract design level the DSS is a at conceptual network However

at the implementation level streak relies on a single data structure the Functional Description FD

presented in detail in App endix B for enco ding each of the three abstract representation layers In the

following sections I rst describ e the general issue of enco ding a at conceptual network as an FD I then

describ e the sp ecic set of features used to enco de the subnets representing the xed and then oating

including historical facts in streaks particular application domain

Representing a at conceptual network as an FD

Each element in a conceptual subnet is represented as an FD with the following attributes

deepsemcat DEEP SEMantic CATegory whose value is entity for a concept no de and relation for

a role arc

concept for concept no des and role for role arcs

token identifying a particular instance of a given concept or role

attrs ATTRibuteS for a concept no de sp ecifying the atomic roles ie those that are not recursively

lled by a concept of the concept

args ARGumentS for a role arc p ointing to the atomic values or concepts that the arc relates

A conceptual subnet is itself represented as an FD with two attributes

ents ENTitieS containing the FD representation of each concept no de in the subnet

rels RELationS containing the FD representation of each role arc in the subnet

For example the subnet of Fig where concepts are in square b oxes and roles in oval b oxes translates

into the FD at the top of Fig The subnet and corresp onding FD are dual representations of the content

of paraphrases such as

The Orlando Magic defeated the Toronto Raptors

The Orlando Magic who triumphed over the Toronto Raptors

The Toronto Raptors who were defeated by the Orlando Magic

Note the dierence in the subnet b etween on the one hand fulledged concepts such as cteam which

are prexed by c and linked to several roles and on the other hand atomic values such as Orlando

that are linked to only a single role This dierence is paralleled among roles Fulledged roles like

rbeat relate two fulledged concepts while atomic roles like rhome relate a fulledged concept with

an atomic value As shown in FD these dierences are reected in the FD representation of a network

FD FD representation of the flat conceptual subnet of Fig p

ents winner deepsemcat entity

concept team

token magicvsraptorswinner

attrs home Orlando franchise Magic

loser deepsemcat entity

concept team

token magicvsraptorsloser

attrs home Toronto franchise Raptors

rels result deepsemcat relation

concept beat

token magicvsraptorsresult

args winner ents winner

loser ents loser

FD Structured ie unflat FD representation of the same content

deepsemcat relation

concept beat

token magicvsraptorsresult

args winner deepsemcat entity

concept team

token magicvsraptorswinner

attrs home Orlando franchise Magic

loser deepsemcat entity

concept team

token magicvsraptorsloser

attrs home Toronto franchise Raptors

Figure Representing a at conceptual net as an FD

atomic values are LISP atoms or strings whereas fulledged concepts are structured FDs app earing under

the ents attribute Similarly atomic roles are subattributes app earing under the attrs attribute inside

FDs representing fulledged concepts and they are lled with atomic values whereas fulledged roles

are structured FDs app earing under the rels attribute with args subattributes p ointing to fulledged

concepts

The b ottom of Fig contains FD an alternative FD representation for the content of FD at the

top of the same gure In this alternative representation the toplevel features ents and rels have b een

suppressed and the path values of the args subattributes replaced by emb edded FDs Although more

concise than FD this representation no longer represents a at network but instead a structured tree with

the particular bias of giving sp ecial head status to the cbeat element of the network From such tree

structure only the rst of the three paraphrases ab ove that convey the content of the at network could

b e generated since in the two others it is one of the two cteam elements of the network that heads the

linguistic structure

The idea of representing at networks as FDs by listing concepts and roles under dierent toplevel

features and indicating the link b etween them through paths instead of emb edding is due to Elhadad

b

FD STREAK input representation for the fact represented in isolation by FD

deepsemcat entity

concept game

token magicvsraptor

attrs results ents winner deepsemcat entity

concept team

token magicvsraptorswinner

attrs home Orlando franchise Magic

loser deepsemcat entity

concept team

token magicvsraptorsloser

attrs home Toronto franchise Raptors

rels result deepsemcat relation

concept beat

token magicvsraptorsresult

args winner ents winner

loser ents loser

Figure Example of fact in input format in streak

Representing xed facts in the STREAK domain

Having describ ed the domainindep endent scheme for representing a DSS network as an FD I now turn to

the sp ecic set of features used for describing the sp orts domain of streak In this domain each rep ort

generated describ es a given game Every fact the rep ort contains thus directly or indirectly concerns that

game Direct facts come from a standard b oxscore summing up the statistics of the game and indirect facts

come from a database ab out the corresp onding sp orts league and include historical statistics compiled over

time and across multiple games

streak takes advantage of b oth this common theme in each rep ort and of the ability of fuf to work with

partial information to represent any content unit as a partial description of the rep orted game Consider

for example the subnet of Fig It provides the result of the rep orted game FD at the top of Fig

represents this fact in isolation The actual input to streak for this fact is shown in Fig It views this

fact not in isolation but instead as one piece of information ab out the game to rep ort This view allows the

incremental addition of content units which characterizes the revisionbased generation mo del defended in

this thesis to reduce at the DSS layer to a succession of simple unication op erations

The skeletal frame of a game description is derived from a series of observations I made ab out the

organization of content inside the corpus of lead sentences that served as target output for streak These

observations rst presented in Section are the following

They are two dierent typ es of facts xed facts that are obligatory in any lead sentence and oating

facts that only o ccasionally app ear in the lead sentence

There are four xed facts the main statistic of a player from the winning team the game result

including the nal scoreline the lo cation of the game and its date

All corpus lead sentences whether the basic ones containing only these four xed facts or the more

complex ones that contain oating facts as well consist of two main syntactic constituents one con

taining the main statistic and one containing the game result

The lo cation is always conveyed separately as a header while the date indierently app ears emb edded

in either of these two main constituents

The oating facts cluster around either the main statistic or the game result in a way predetermined

by their semantic class eg additional statistics of a player of the winning team systematically cluster

around the main statistic while result streaks systematically clustered around the game result

Let us now see how these observations are reected in the skeletal frame of a game description The

input FD for a basic draft sentence containing only xed facts is shown in Fig The game lo cation

abbreviated addr for address and its date that together dene the general circumstances of the game and

do not inuence the sentence structure are group ed under the attribute setting The rest of the game

description consists of two subnets one for the main statistic cluster under the attribute stats and one for

the game result cluster under the attribute results

Compare FD of Fig with

FD where each fulledged role is explicitly represented by an FD under the rels attribute which

args subattributes are lled with paths p ointing to the concepts the role relates

FD where each such role is represented by an attribute whose subattributes are directly lled with

the FDs describing the concepts to which the role relates

reveals that the FD representation of a draft input in streak is really an hybrid of the pure p ointer

approach of FD and the pure emb edding approach of FD

The emb edding approach is more concise but prevents the generation of phrases where a role that is

not explicitly represented as an FD is explicitly conveyed as a linguistic constituent In the draft input of

Fig the address date score host and visitor roles are not explicitly represented b ecause the

target sublanguage of streak observed in the corpus did not contain phrases explicitly mentioning these

roles such as

the game opp osing the Magic to the Raptors was held in Orlando

the date of the game was Saturday

the score of the game was

Orlando was hosting this game against the Raptors

Instead only the address date score host and visitor concepts were explicitly mentioned with

the role linking them to the game implicitly expressed by the sentence structure as in

Orlando FL Shaquille ONeal scored p oints Friday night lifting the Orlando Magic to a

victory over the Toronto Raptors

The approach for the input content representation in streak is thus to represent explicitly as FDs

only those roles that are explicitly realized by a linguistic constituent in at least one phrase of the target

sublanguage In the sublanguage of the basic lead sentences summarizing basketball games this is the case

for the winner loser and result relations as demonstrated by the following example paraphrases

Orlandos victory over Toronto

Torontos loss at the hand of Orlando

Orlando triumphed over Toronto

For a basic main statistic cluster the role stat representing this statistic is the only one needed It is

in general explicitly realized as in ONeal scored p oints

but can also b e left implicit as in ONeals p oints

Representing oating and historical facts in the STREAK domain

The target sublanguage of streak contains ve dierent typ es of oating facts

Orlando FL Shaquille ONeal scored points Friday night

lifting the Orlando Magic to a victory over the Toronto Raptors

FD

deepsemcat entity

concept game token toratorl

attrs setting ents Orlando FL

addr deepsemcat entity

concept address token orlfl

attrs city Orlando state FL

Friday night

date deepsemcat entity

concept date token frinite

attrs dayname Friday daypart night

Shaquille ONeal scored points

stats ents statca deepsemcat entity

concept player token oneal

attrs firstname Shaquille lastname ONeal

stat deepsemcat entity

concept gamestat token onealptvsden

attrs value unit pt

rels stat deepsemcat relation

role gamestatrel token onealscoringvsden

args carrier ents statca stat ents stat

lifting the Orlando Magic to a victory over the Toronto Raptors

results ents host deepsemcat entity

concept team token magic

attrs home Orlando franchise Magic

visitor deepsemcat entity

concept team token raptors

attrs home Toronto franchise Raptor

score deepsemcat entity

concept score token toratorlscore

attrs win lose

rels winner deepsemcat relation

role winner token toratorlwinner

args game toplevel winner ents host

loser deepsemcat relation

role loser token toratorlloser

args game toplevel loser ents visitor

result deepsemcat relation

role beat token toratorlresult

args winner ents host loser ents visitor

Figure Example of input FD for basic draft

Additional game statistics rst order nonhistorical fact

Nonstatistic background prop erties rst order nonhistorical fact

Game statistic records second order historical fact

Game result streaks second order historical fact

Streak records third order historical fact

I review the FD representation of each of these typ es in turn in the following subsections

Additional statistics Each additional statistic is distinguished by an integer inversely pro

p ortional to their relevance The subnet representing any statistic either the main statistic or an additional

one consists of a single statN relation with two arguments one p ointing to the statN concept representing

the value of the statistic and the other p ointing to the statNca STATistic numb er Ns CArrier concepts

representing its carrier ie the player or team whose p erformance is summedup by the statistic The FD

under stats in Fig is an example of such subnet

Background prop erties Similar in structure to game statistics background prop erties also

consists of a single role whose arguments p oints to two concepts one describing the prop erty and the other

its carrier For example the reserve status of the player Jay Humphries is represented by the following FD

deepsemcat entity

concept game

token utaatbos

attrs stats ents statcastatus deepsemcat entity

concept reserve

token humphriesreserve

statca deepsemcat entity

concept player

token humphries

attrs firstname Jay lastname Humphries

rels statcastatus deepsemcat relation

role playerstatus

token humphriesstatus

args player ents statca

status ents statcastatus

Note how the prop erty carrier is identied by way of the statistic that he p erformed This reects the

fact that in the target sublanguage such background fact is always opp ortunistically woven to the expression

of a statistic as in

Jay Humphries came o the b ench to score p oints or

Jay Humphries scored p oints o the b ench or

Reserve Jay Humpries scored p oints

instead of a separate sentence such as Jay Humphries is a reserve player

Statistic records A record up date relates a new statistic to an historical statistic An his

torical statistic is an abstract set of statistics As opp osed to an isolated statistic which consists of a single

relation an historical statistic is a complex conceptual subnet relating multiple concepts

duration the time span over which the set of statistics has b een compiled

genelt the GENeric ELemenT of the set an abstract sp ecication of the particular typ e of statistics

that the set contains

geneltca the CArrier of the GENeric ELemenT common to all statistics in the set New Statistic carrier c−stat−rel stat

c−stat

unit value

39

c−player

first−name last−name

> Record Update Karl Malone c−point

carrier stat c−stat unit

c−stat−rel c−season c−integer

gen−elt duration max−val

Historical

Statistic c−histo−stat

Figure Example of conceptual network containing an historical record fact

maxval or minval the MAXimum or MINimum value for the particular of typ e statistic sp ecied

by genelt and geneltca and over the p erio d of time sp ecied by duration

This internal complexity of an historical statistic makes a statistic record up date a second order fact

as opp osed to rst order facts that relate only single no des in the conceptual net such as the additional

statistics or background prop erties presented ab ove An example subnet representing an historical statistic

in is given in Fig inside the b ottom square b ox It abstractly represents the set of Karl Malones scoring

statistics over the current season

The top subnet in the dotted b ox of the gure contains the new statistic onto which streak attaches the

related historical one This attachment is rendered p ossible by the fact that b oth facts share some elements

in the case at hand the carrier of the statistic and its unit in the intersecting p ortion of b oth b oxes

The record up date is shown by the arc containing the sign which express that the new statistic value

surpasses the maximum value of the historical statistic The FD enco ding of this statistic record up date

subnet is given in Fig

FD representing the record update fact conveyed by the uppercased constituent in the paraphrases

Karl Malone scored A SEASON HIGH points

Karl Malones SEASON BEST points

Karl Malone BROKE HIS SEASON RECORD WITH points

deepsemcat entity

concept game

token utaatlac

attrs stats ents histostat deepsemcat entity

concept histostat

token kmaloneptatlacrefset

histostatduration deepsemcat entity

concept season

token kmaloneptatlacrefsetduration

histostatgenelt deepsemcat entity

concept gamestat

attrs unit pt

histostatgeneltca deepsemcat entity

concept player

token kmalone

attrs firstname Karl lastname Malone

histostatextr deepsemcat entity

concept integer

token kmaloneptatlacrefsetextr

rels histostatduration deepsemcat relation

role duration

token kmaloneptatlarefsetdurationrel

args set ents histostat

duration histostatduration

histostatgenelt deepsemcat relation

role genelt

token kmaloneptatlacrefsetgenelt

args set ents histostat

genelt ents histostatgenelt

histostatgeneltca deepsemcat relation

role gamestatrel

token kmaloneptatlacrefsetgeneltca

args carrier ents histostatgeneltca

stat ents histostatgenelt

histostatextr deepsemcat relation

role maxval

token kmaloneptatlacrefsetextrrel

args extrval ents histostatextr

set ents histostat

histostatupdate

deepsemcat relation

role

token kmalonebreakptseasonhighatlac

args statval

histostatextr ents histostatextr

Figure FD representation of the subnet of Fig

Result streaks Similarly to historical statistics that are abstract sets of statistics result

streaks are abstract sets of games A streak up date is a second order fact relating a new game result with

a streak sp ecifying whether the new result extends or interrupts the streak The streak itself is a complex

conceptual subnet relating multiple concepts

genelt the generic element of the set an abstract sp ecication of the particular typ e of games that

the streak is made of

geneltwinner or geneltloser the winner or loser common to each game in the streak

genelthost or geneltvisitor the host or visitor common to each game in the streak re

sp ectively for home and road streaks for overall streaks this concept is not needed

card the cardinal of the set ie the duration of the streak in numb er of games

An example FD representing a streak extension content unit is shown in Fig

Streak records A historical streak is an abstract set of streaks and hence an abstract set

of an abstract set of games A record streak up date is thus a third order fact that relates a new game

result with an historical streak sp ecifying whether the new result extend one element of the historical streak

making it the longest element of that set An historical streak is the most complex entity of the streak

subdomain and involves the following relations

duration the time span over which the set of streaks have b een compiled

genelt the generic element of the set an abstract sp ecication of the particular typ e of streaks that

the set contains

geneltgenelt an abstract sp ecication of the common typ e of games that each streak in the set

is made of

geneltgeneltwinner or geneltgeneltloser the winner or loser common to each game

of each streak that the set contains

geneltgenelthost or geneltgeneltvisitor the host or visitor common to each game

of each streak that the set contains for sets of home or road streaks for sets of overall streaks this

relation is not needed

maxcard the maximum cardinal for any streak in the set ie the length in numb er of games of

the longest streak in the historical streak

3

An example FD representing a streak record up date content unit is shown in Fig In this example

the duration of the historical streak is indirectly dened in terms of the lifetime of the common game host

of each element in the historical streak

The SSS language

As explained in Section and illustrated by the detailed example given in the previous section a DSS

represents b oth the explicit and the recoverable content of an utterance and it do es so at a very ne grain

Each element in a DSS network emb o dies no more that the meaning of a single word or even a single semantic

feature of a word Moreover since it simultaneously represents multiple p ersp ectives on some concepts only

a few of these ne grained elements end up explicitly realized by a lexical item in any given utterance

generated from the DSS Finally a DSS is at with no notion of constituency or dep endency

It the phrase planner that decides which elements of the DSS network to explicitly realize how to

aggregate these elements into linguistic constituents and what are the dep endency relations among them

These decisions result in the SSS semantic tree There are two distinct typ es of elements in such a tree

3

To make it t on a single page token features have b een removed from this FD

FD representing the content of paraphrases such as

the Utah Jazz handed the Boston Celtics their sixth straight home defeat

the Utah Jazz extended the Boston Celtics losing streak to six games

deepsemcat entity

concept game

token utaatbos

attrs results ents streak deepsemcat entity

concept streak

token bosstreakvsuta

attrs card

streakgenelt deepsemcat entity concept game

streakgenelthost deepsemcat entity

concept team

token celts

attrs home Boston franchise Celtic

rels streakgenelt deepsemcat relation

role genelt

token bosstreakvsutagenelt

args set ents streak

genelt ents streakgenelt

streakgeneltloser deepsemcat relation

role loser

token bosstreakvsutaloser

args game ents streakgenelt

loser ents host

streakgenelthost deepsemcat relation

role host

token bosstreakvsutahost

args game ents streakgenelt

host ents host

streakext deepsemcat relation

role streakextension

token bosstreakvsutaext

args extension toplevel

streak ents streak

Figure FD representing a streak extension

FD representing the record streak update fact conveyed by the uppercased constituent in the paraphrases

the Boston Celtics FRANCHISE RECORD six games home losing streak

the WORST EVER sixth consecutive defeat at home for the Boston Celtics

deepsemcat entity

concept game

attrs

results

ents histostreak deepsemcat entity concept histostreak

histostreakgenelt deepsemcat entity concept streak

histostreakgeneltgenelt deepsemcat entity concept game

histostreakextrcard deepsemcat entity concept integer

histostreakgeneltgenelthost deepsemcat entity

concept team

attrs home Boston franchise Celtic

hostlifetime deepsemcat entity concept duration

rels hostlifetime deepsemcat relation

role lifetime

args entity ents host

lifetime ents hostlifetime

histostreakduration deepsemcat relation

role duration

args entity ents histostreak

duration ents hostlifetime

histostreakgenelt deepsemcat relation

role genelt

args set ents histostreak

genelt ents histostreakgenelt

histostreakgeneltgenelt deepsemcat relation

role genelt

args set ents histostreakgenelt

genelt ents histostreakgeneltgenelt

histostreakgeneltgeneltloser deepsemcat relation

role loser

args game ents histostreakgeneltgenelt

team ents host

histostreakgeneltgenelthost deepsemcat relation

role host

args game ents histostreakgeneltgenelt

team ents host

histostreakextrcard deepsemcat relation

role maxcard

args card ents histostreakextrcard

set ents histostreak

histostreakupdate deepsemcat relation

role

args streaklen

histostatextr ents histostreakextrcard

Figure FD representing a streak record up date

Encyclopedic subtrees identied by the feature surfsemcat encyclo that corresp ond to a whole

chunk of the DSS network

Rhetorical subtrees identied by the feature surfsemcat rhetor that are aggregates of several

scattered chunks of the DSS network linked by rhetorical relations inside the SSS

Encyclop edic subtrees are structured by the fact generator whereas rhetorical subtrees are structured by

the phrase planner I describ e these two dierent typ es of SSS subtrees in turn in the following subsections

As opp osed to a at network a structured tree can b e trivially represented by an FD As explained in

Section B of App endix B any FD corresp onds to a directed graph Since paths are the only features

that intro duce the p ossibility of several arcs reaching the same no de a pathfree FD reduces to a tree with

each of its attributes corresp onding to an arc and each of its subFDs to a dep endent subtree

Encyclop edic semantic subtrees

In addition to surfsemcat an encyclop edic subtree contains two more obligatory attributes

concept the concept or role in the DSS subnet that the subtree realizes at the SSS layer

onto sp ecifying the ONTOlogical p ersp ective on the concept and whose value can b e either indiv

for INDIVidual event quality quantity place or time

Dierent optional attributes are available for subtrees with dierent onto features

For events the args for ARGumentS attributes dene the thematic structure of the event these

attributes closely corresp ond to the DGS attributes dening the thematic structures of pro cesses in

surge cf Section B of App endix B

For qualities quantities places and times the concept sp ecic restrictors attributes circumscrib e a

sp ecialization of the general concept indicated by the concept attribute

For individuals similar restrictors attributes for referring to the individual using a common NP

or the names attributes for referring to the individual using a prop er NP

Finally any encyclop edic subtree may also contain a root attribute which p oints to either the concept

or one of the optional attributes of the subtree and denes which semantic element is to head the linguistic

constituent corresp onding to this subtree

Rhetorical semantic subtrees

As explained in Section there are three distinct typ es of rhetorical semantic subtrees paratactic

complexes hyp otactic complexes and rhetorical events These three typ es are distinguished by the struct

for STRUCTure attribute with resp ective values paratax hypotax and event

Paratactic complexes represent conjunctions and app ositions and have two other attributes

elts for ELemenTS containing each element of the complex

rel for RELation sp ecifying the semantic relation underlying the grouping of these element inside the

paratactic complex and whose value in the current streak subdomain can b e temporalinclusion

teammate coagent coref for COREFerent or cooccur for COOCCURrent

Hyp otactic complexes represent complex clauses or nominal and have two other toplevel attributes

root whose value is an emb edded description for the head of the complex

rels for RELationS whose subattributes are the rhetorical relations linking the dep endent con

stituents of the complex to its head the description of each dep endent constituent is emb edded under

one these subattributes

The list of rhetorical relations currently used in the streak subdomain is cardinal cardof for

CARDinal OF ie the converse of cardinal compar for COMPARison score cooccur duration

standard ordinal opposition result possessor time location agentof frequency

instrument and type Most of them corresp ond either directly or as converse relation to DGS attributes

used in surge in the description of complex nominals clause participants clause predicate mo diers and

clause circumstancials cf Sections B and B of App endix B

Rhetorical events represent simple clauses whose head verb do es not realize any sp ecic element of the

DSS network and also have two other toplevel attributes

root a general event class eg transfp oss for TRANSFer of POSSession for a comp osite mate

rialp ossessive pro cess

args for ARGumentS containing the event participants cast in the argument structure corresp onding

to that event class indicated in the ro ot

These two attributes are the same as those for encyclop edic events Finally all three typ es of rhetorical

subtrees can also contain a focus attribute containing a path indicating which one among the other attributes

should app ear up front in the sentence to generate

Example FDs enco ding SSSs

In Section I gave the design level SSS for three phrases conveying the same content from dierent

p ersp ectives using dierent linguistic structures I now give the FD enco ding of each of these three SSSs

These FDs illustrate how syntagmatic paraphrasing is represented at the detailed implementation level

They also provide a variety of examples for b oth the encyclop edic and rhetorical subtrees discussed in the

previous section

The FD enco ding of sss shown on the left side of Fig p is given in Fig that of sss shown

on the right side of Fig p is given in Fig and that of sss shown in Fig p is given in

Fig Since these three examples are either hyp otactically structured or event structured an example

FD enco ding a paratactically structured SSS is given in Fig

The draft representation

As explained in Section the representation of the draft in streak incorp orates all three layers of

abstraction the DSS the SSS and the DGS In addition it also contains an explicit representation of the

corresp ondence rst b etween the constituents of DGS and the constituent of the SSS and then b etween the

constituents of the SSS and the elements of the DSS

At the implementation level the draft is represented by a threelayer FD with the following conventions

The toplevel of the FD is the DGS of the draft

The toplevel constituent of the DGS contains a sp ecial attribute sss whose value is the SSS for the

whole draft

In turn the toplevel constituent of this SSS contains a sp ecial attribute dss whose value is the DSS

for the whole draft

Each sub constituent C in the DGS also contains an sss attribute As opp osed to the toplevel sss

dg s

attribute whose value is an FD the value of such an emb edded sss attribute is a path p ointing to the

sub constituent C in that FD that C realizes at the DGS layer

sss dg s

Similarly each sub constituent C in the SSS contains an dss attribute p ointing to the DSS element

sss

that C realizes at the SSS layer

sss

FD representing the SSS for paraphrases such as

Orlando defeated Toronto at home

Orlando triumphed over Toronto at home

surfsemcat rhetor

struct hypotax

root surfsemcat encyclo

onto event

concept beat

args agent surfsemcat encyclo

onto indiv

concept team

names home Orlando

root names

affected surfsemcat encyclo

onto indiv

concept team

names home Toronto

root names

focus args agent

rels location surfsemcat encyclo onto place concept host

focus root

Figure FD enco ding of sss

FD representing the SSS for paraphrases such as

The Magics home victory over the Raptors

The homecourt win of the Magic against the Raptors

surfsemcat rhetor

struct hypotax

root surfsemcat encyclo onto indiv concept winner

rels possessor surfsemcat encyclo onto indiv concept team names franchise Magic

location surfsemcat encyclo onto indiv concept host

opposition surfsemcat encyclo onto indiv concept team names franchise Raptor

Figure FD enco ding of sss

FD representing the SSS for paraphrases such as

The Magic claimed a home win over the Raptors

The Magic posted a homecourt victory against the Raptors

surfsemcat rhetor

struct event

root activity

args agent surfsemcat encyclo

onto indiv

concept team

names franchise Magic

root names

range surfsemcat rhetor

struct hypotax

root surfsemcat encyclo onto indiv concept winner

rels location surfsemcat encyclo onto indiv concept host

opposition surfsemcat encyclo

onto indiv

concept team

names franchise Raptor

root names

focus args agent

Figure FD enco ding of sss

In Fig of Section I gave a design level graph representation of the draft for the phrase the

Orlando Magic defeated the Toronto Raptors In Fig I give the corresp onding implementation level

FD representation of the same draft The sss and dss features are emphasized in upp ercase

The pro cessing comp onents

Aside from the p ortable syntactic frontend surge streak prop er is made of three comp onents the

phrase planner the lexicalizer and the reviser The design and task of these three mo dules were describ ed

in Section All three mo dules were implemented declaratively in fuf The rst two mo dules use the

topdown recursive unication mechanism of fuf as a rule interpreter The set of phrase planning rules

and the set of lexicalization rules are factorized into two hierarchies each represented as a Fuf Grammar

FG In such a hierarchy several rules sharing a common part are group ed in a fuf conjunction in which

the toplevel features enco de the common part and are followed by a fuf disjunction enco ding the distinct

parts prop er to each rule The reviser also consists of an interpreter and a declarative rule base However

since the application of revision rule is in general a nonmonotonic op eration the revision rule interpreter is

far more complex than the phrase planning and lexicalization rule interpreters and do es not entirely rely on

unication It also relies on an array of lowlevel FD manipulation functions provided by the fuf package

I present the implementation of the three comp onents in detail in the following subsections

The phrase planner

The task of the phrase planner is to map an input DSS onto an output SSS As explained in Section

this task involves

Picking which element in the DSS network to realize explicitly in a word of the sentence to generate

and thus which to leave implicit recoverable from either the sentence structure or knowledge ab out

FD representing the SSS for paraphrases such as

Barkley scored points and Majerle added

Barkley struck for points and Majerle fired in

surfsemcat rhetor

struct paratax

elts surfsemcat encyclo

onto event

concept gamestatrel

args agent surfsemcat encyclo

onto indiv

concept player

names lastname Barkley

root names

created surfsemcat encyclo

onto quantity

concept gamestat

restrictors value unit pts

root restrictors

surfsemcat encyclo

onto event

concept gamestatrel

args agent surfsemcat encyclo

onto indiv

concept player

names lastname Majerle

root names

created surfsemcat encyclo

onto quantity

concept gamestat

restrictors value

root restrictors

Figure Example FD enco ding a paratactically structured SSS

the domain

Cho osing an ontological p ersp ective from which to present each picked element

Grouping the picked element into constituents

Cho osing rhetorical relations linking the picked elements inside each constituent

The input to the phrase planner consists of an FD representing a DSS This FD is emb edded under a dss

attribute and the result is then unied with the FG representing the phrase planning rule base The output

is an enriched FD containing at its toplevel the SSS corresp onding to the input DSS A round of phrase

planning thus simply consists of a single topdown recursive unication op eration Rememb er however that

the phrase planner is called several times during the generation of a sentence once to plan the initial draft

sentence and then rep eatedly to plan each phrase incorp orated to the draft during revision

It is the output SSS layer that is built topdown during unication The toplevel SSS constituent is

built rst This task includes translating the features of the DSS element chosen as the SSS head and then

casting other related DSS elements inside the sub constituent structure of this toplevel SSS Once this is

done recursion starts on each of these sub constituent

cat clause

tense past

process type material lex defeat

partic agent cat proper

head cat teamname

franchise lex Magic

home lex Orlando

SSS SSS ARGS AGENT

affected cat proper

head cat teamname

franchise lex Raptor

home lex Toronto

SSS SSS ARGS AFFECTED

SSS surfsemcat encyclo

onto event

concept beat

args agent surfsemcat encyclo

onto indiv

concept team

names home Orlando

root names

DSS DSS RELS RESULT ARGS WINNER

affected surfsemcat encyclo

onto indiv

concept team

names home Toronto

root names

DSS DSS RELS RESULT ARGS LOSER

focus args agent

DSS ents host deepsemcat entity

concept team token magic

attrs home Orlando franchise Magic

visitor deepsemcat entity

concept team token raptors

attrs home Toronto franchise Raptor

rels winner deepsemcat relation

role winner token toratorlwinner

args game toplevel winner ents host

loser deepsemcat relation

role loser token toratorlloser

args game toplevel loser ents visitor

result deepsemcat relation

role beat token toratorlresult

args winner ents host loser ents visitor

Figure FD representing the draft phrase the Orlando Magic defeated the Toronto Raptors

SSS surfsemcat encyclo

onto event

concept beat

focus args agent

args agent DSS DSS ATTRS RELS RESULT ARGS WINNER

affected DSS DSS ATTRS RELS RESULT ARGS LOSER

DEEPMAPCSET args agent args affected

DSS cf Fig p

Figure Partial SSS after toplevel unication

To illustrate how an SSS layer is built on an example consider again the SSS layer of the draft FD in

Fig It results from the recursive unication of the DSS layer in the same draft FD with the FG

emb o dying the phrase planning rules The partial SSS obtained after the toplevel unication is shown in

Fig

At this p oint the following decisions have b een made

Out of the ve elements in the DSS network only three will b e explicitly realized by a linguistic element

in the sentence generate the two entities host and visitor and the relation result

This result relation will head the linguistic realization

It will b e presented as a material event

The winner argument of this relation which in this particular case is also the host will b e mapp ed

onto the agent of the material event this is indicated by the DSS p ointer lling agent

The loser argument of this relation which in this particular case is also the visitor will b e mapp ed

onto the agent of the material event this is indicated by the DSS p ointer lling affected

The deepmapcset metaattribute indicates that rst the subFD under agent and then the subFD

under affected must b e in turn unied with the phrase planning FG After b oth these unications are

completed the resulting SSS is the nal one shown in Fig

The phrase planning rules relevant for the toplevel mapping stage of the example SSS ab ove are shown

4

in Fig They are enco ded as the FG conjunction gameresultmap

This conjunction enco des three alternative rules for planning the game result There are two main

parts in this conjunction The rst part is the common precondition of these three rules testing the input

DSS This precondition is enco ded by the sub conjunction gameresult at the b ottom of the gure This

sub conjunction is a DSS pattern which matches any subnet representing a game result The second part

contains the dierent SSS output building actions prop er to each of these three rules

Rule enco des the realization of the game result as a clause headed by a full verb This full verb conveys

the result relation of the DSS The output of Fig results from the application of rule Rule enco des

the realization of the game result as a clause headed by a supp ort verb This supp ort verb do es not convey

in itself any element in the DSS but only serves as a syntactic supp ort for its range argument that realizes

the winner relation of the game result The choice b etween the rules and is random It enco des the

paraphrasing alternative b etween

the Orlando Magic defeated the Toronto Raptors Rule and

the Orlando Magic claimed a victory over the Toronto Raptors Rule

4

This conjunction is in fact a simplied version of the one actually used in the co de which also maps the nal scoreline of

the game ignored here as part of the game result

defconj gameresultmap

dss gameresult

ALT fullresult

alreadymapped none

RALT fullresultstruct

Game result planning rule clauses headed by full verb

eg WINNER defeated LOSER

surfsemcat encyclo

onto event

concept dss rels result role

args agent dss dss rels result args winner

affected dss dss rels result args loser

deepmapCSET args agent args affected

Game result planning rule clauses headed by support verb

eg WINNER claimed a victory over LOSER recurse for part

surfsemcat rhetor

struct event

root activity

args agent dss dss rels winner args winner

range dss dss

alreadymapped winner

deepmapCSET args agent args range

Game result planning rule NP eg a victory over LOSER

alreadymapped under winner

surfsemcat rhetor

struct hypotax

focus rels score

root dss dss rels winner

rels opposition dss dss rels loser args loser

deepmapCSET root rels opposition

defconj gameresult

ents host deepsemcat under entity concept under team

visitor deepsemcat under entity concept under team

rels winner deepsemcat under relation

role under winner

args game GIVEN winner GIVEN

loser deepsemcat under relation

role under loser

args game GIVEN loser GIVEN

result deepsemcat under relation

role under beat

args winner GIVEN loser GIVEN

Figure Example of phrase planning rules

There are many such random choices in the phrase planner They enco de options that are in most

circumstances equivalent Their presence insures that the generator will not always pro duce the same

output linguistic form when rep eatedly given in input the same typ e of information to rep ort In order to

force the choice of one of these random alternatives a partial description of the desired output can b e passed

as input to the phrase planner as an optional parameter Only the options that can unify with this partial

output can then b e chosen in each disjunction of the phrase planning FG

In the topdown pro cess of planning an entire draft there are cases when the winning team has already

b een mapp ed inside a matrix SSS constituent b efore the realization of the game result starts In such

contexts none of the two forms ab ove can b e used This the case for example when at the time of recursing

on the game result subnet the phrase planner has already built the SSS for a toplevel draft such as

b elow In this draft form the use of a materiallo cative event to link the main statistic to the game result

instead of a temp oral relation as in or a paratactic complex as in prohibits the application of rules

and for planning the structure of the game result itself

Shaquille ONeal scored p oints leading the Orlando Magic to

Shaquille ONeal scored p oints as the Orlando Magic defeated the Toronto Raptors

Shaquille ONeal scored p oints and the Orlando Magic claimed a victory over the Toronto Raptors

Rule handles these cases It realizes the game result as an NP such as a victory over the Toronto

Raptors which do es not mention the winning team The feature alreadymapped tests whether at the

moment of mapping the game result subnet the winning team has or not already b een mapp ed in the

matrix SSS constituent Since this NP realization of the game result p erfectly ts the pattern of the range

argument of the supp ort verb clause realization this range argument is built through recursion within this

very FG conjunction

The lexicalizer

The task of the lexicalizer is to map an input SSS onto an output DGS As explained in Section this

task involves

Cho osing for each semantic constituent in the SSS the syntactic category of the linguistic constituent

realizing it at the DGS layer

Cho osing the op enclass lexical items of this linguistic constituent

Sp ecifying the nondefault values for the syntactic features asso ciated with the chosen syntactic cate

gory

Mapping the rhetorical relations linking the semantic constituents at the SSS layer onto the thematic

and syntactic roles linking the corresp onding syntactic constituents at the DGS layer

The phrase planner and the lexicalizer share the same interpreter but each uses it with a dierent rule

base Lexicalization is therefore implemented as the topdown recursive unication of an FD where the

input SSS has b een emb edded under an sss attribute with an FG enco ding the lexicalization rule base

For a linguistic constituent C realizing at the DGS layer an SSS subtree C the choice of syntactic

dg s sss

category dep ends essentially on the onto feature of C and the choice of lexical head dep ends essentially on

sss

5

the concept feature of C This is true for b oth simple and complex linguistic constituents However at

sss

the SSS layer the latter are represented by rhetorical subtrees lacking these two features onto and concept

that are prop er to encyclop edic subtrees as explained in Section These rhetorical subtrees are

recursive structures whose root attribute can b e lled either by an encyclop edic subtrees end of recursion

or by another rhetorical subtree recursion For each syntactic constituent the rst task of the lexicalizer

5

Only hypotactic complexes are considered here since the paratactic ones do not have a lexical head and those of typ e list

do not even have a syntactic category

is to ll an attribute sssroot which p oints to the head encyclopedic element of the input SSS subtree For

encyclop edic subtrees the value of this attribute is simply sss But for rhetorical ones it can b e a longer

path following several root indirections down the recursive rhetorical structure For example when given

as input the SSS at the top of Fig p the lexicalizer would add a pair sssroot sss root

By making the relevant encyclop edic features accessible at the toplevel this sssroot feature allows the

choice of syntactic category and lexical head to b e handled uniformly regardless of the inputs rhetorical

complexity

To illustrate the pro cess of lexical choice in general let us consider verb choice Excerpts from the fuf

disjunction enco ding verb choice rules in streaks lexicalizing FG are shown in Fig A rst high level

lo ok at this disjunction reveals several p oints ab out lexical choice

In general verbs at least nite ones are appropriate to realize the head constituent of event descrip

tions This is indicated by the fact that the testing part of each branch is under sssroot and that all

branches corresp ond to one of the two typ es of event descriptions at the SSS layer rhetorical branch

and encyclop edic branch and

The choice of verb is rst indexed by either the concept feature for encyclop edic events or the general

event typ e sp ecied by the root feature in rhetorical events that they can realize A p olysemous

verb can thus app ear in as many dierent branches as it has senses

Each b ottom branch represents a set of synonymous verbs that are randomly chosen

Single word verbs and prep ositional verbs are uniformly represented

The rst two branches of the disjunction enco des some of the head verbs options for clauses realizing

game results They are the options involved for the typ e of game result input SSS built by the application

of the phrase planning rules discussed in the previous section Each ts only a given syntactic structure the

rst branch ts supp ort verb headed clauses with agent and range participants as in

Orlando p osted a victory over Toronto

while the second branch ts full verb headed clauses with agent and aected participants as in

Orlando routed Toronto

This typ e of syntactic structure constraints on lexical choice underlines the dual asp ects of lexicalization

The syntagmatic asp ect consisting of selecting a word such that the constituents of the semantic input

can b e cast in the argument structure of the word

The paradigmatic asp ect consisting of selecting a word that can realize the head concept in the semantic

input

In streak these two asp ects are separated in dierent disjunctions of the lexicalization FG The

surfmapverbs disjunction shown in Fig only enco des the paradigmatic asp ect of verb choice The

syntagmatic asp ect is enco ded in the surfmappartic disjunction

The third branch in surfmapverbs enco des the choice of verbs for realizing game statistics It illustrates

how constraints other than the ro ot concept and the syntactic structure inuence lexical choice The rst

of these constraints enco ded in the statnum feature is passed to the lexicalizer by the reviser It tests

whether the game statistic clause under construction is the rst among the additional statistics added to

the the main statistic during the revision of the draft It indicates that in such a case the most appropriate

verb to cho ose is to add as in Charles Barkley scored p oints and Dan Ma jerle added

How such constraints are passed from the reviser to either the phrase planner or the lexicalizer is explained

in the next section

The rest of the disjunction distinguishes b etween versatile verbs eg to have valid for lexicalizing

any typ e of game statistic as shown by

Barkley had p oints and Barkley had reb ounds

defalt surfmapverbs

activity ie verbs part of collocation with range object

sssroot root under activity

struct under event

proc lex RALT claim record post pull out clinch nail down

beat

sssroot concept under beat

onto under event

proc lex RALT defeat beat triumph over coast past down rout

game statistic

sssroot concept under gamestatrel

onto under event

args created concept under gamestat

ALT statnum

For st additional statistic use to add

statnum under proc lex add

Versatile verbs for any type of statistics

RALT gamestatrelvspecificity

proc lex RALT have finish with wind up with collect

Verbs specialized by statistic type

ALT statunit

sssroot args created restrictors unit under pt

proc lex RALT score net pump in fire in strike for

sssroot args created restrictors unit under reb

proc lex RALT grab haul in get snare pull down

Figure Enco ding verb choice in streak

from sp ecialized ones valid for only one sp ecic typ e of statistics as shown by

Barkley red in p oints and Barkley pulled down p oints while

Barkley pulled down p oints and Barkley red in p oints

The reviser

The input to the reviser is twofold

a threelayer FD enco ding the initial draft at the three layers of abstraction DSS SSS and DGS such

as the example given in Fig

A list of FDs each enco ding a DSS subnet representing a oating fact to opp ortunistically incorp orate

to the draft in order of decreasing relevance

The output of the reviser is a threelayer FD enco ding the nal draft which incorp orates the content

conveyed by the initial draft plus as many oating facts as could b e added without exceeding a lexical length

of and a syntactic depth of for the nal draft

The reviser consists of three parts

The declarative revision rule base enco ding the revision op erations presented in Section

The revision rule interpreter that takes as input b oth the current draft and a single oating fact and

p erforms one revision increment

The revision monitor that rep eatedly calls the revision rule interpreter and controls the overall incre

mental revision pro cess

I discuss each of these three parts in turn in the following subsections

The revision rule base

A revision rule has two parts a Left Hand Side LHS which sp ecies the conditions in which the rule can

apply and a Right Hand Side RHS which sp ecies the set of transformations that the draft undergo es

during the application of the rule It is thus enco ded as an FD with two toplevel attributes lhs and

rhs The revision rule base is enco ded as an FG where revision rules are factorized in a recursive set of

fuf disjunctions and conjunctions The resulting FG structure parallels the hierarchical structure of the

revision op erations presented in Section The toplevel disjunction distinguishes among the general classes

of revision to ols such as adjoin absorb or adjunctization and the emb edded disjunctions enco de the

lowerlevel distinctions in terms of the accompanying side transformations the semantic andor syntactic

roles of the constituent added displaced or deleted from the draft by the op eration etc

The LHS of a revision rule is enco ded as an FD with three attributes

bls Base Layered Sp ecication whose value is a threelayer FD dening the typ e of drafts onto which

the rule can b e applied

adss Additional Deep Semantic Sp ecication whose value is a DSS FD dening the typ e of oating

facts that the rule can incorp orate to the draft

tool whose value is the name of the revision op eration enco ded by the rule

This last attribute is optional and is used only for control and testing purp oses When it is not present

the reviser uses the rst rule whose bls feature matches the current draft and whose adss matches the

rst DSS in the ordered oating fact list Each revision op eration whether general or sp ecic is given a

name These names are group ed into a hierarchy paralleling the sp ecialization hierarchy of the corresp onding

op erations using the definefeaturetype construct of fuf

The RHS of a revision rule simply consists of a list of revision actions These revision actions are the

building blo cks used to implement the structural transformations involved in the application of the revision

rules The revision workspace onto which each of these actions op erate is an FD with four toplevel attributes

bls Base Layered Sp ecication whose value is the threelayer FD enco ding the old draft b efore the

application of any RHS action

rls Revised Layered Sp ecication whose value is the threelayer FD enco ding the new draft under

construction and resulting from the RHS actions applied so far initially it is a copy of bls

dss whose value is an FD enco ding the common DSS layer of the bls and the rls throughout the

application of the actions

adss Addition Deep Semantic Sp ecication whose value is the DSS FD enco ding the oating fact to

incorp orate to the draft for the current revision increment

Two p oints concerning the structure of this revision workspace deserve explanations The rst p oints is

the need for keeping the original input draft around in the bls during the application of the actions building

the new draft in the rls This is necessary b ecause as explained in Section most revision op erations

are nonmonotonic in order to accommo date a new oating fact they alter the linguistic realization of

the draft content Such an alteration is decomp osed into several lowlevel actions cutting and pasting the

linguistic constituents of the draft The bls buers the cut elements b efore they get pasted in a new lo cation

in the rls

The second p oint requiring explanation is the fact that the the bls and the rls share a common DSS layer

throughout the application of the revision actions This is the case b ecause these actions are implementing

the accommo dation of the new fact enco ded in the adss only at the SSS and DGS layers As mentioned in

Section incorp orating this new fact to the draft at the DSS layer only involves unifying the adss with

the DSS layer of the input draft The result of this unication is then placed in the toplevel dss attribute

during the initialization of the revision workspace To avoid duplication of content which may intro duce

inconsistency in such a nonmonotonic context the emb edded dss attributes inside the two threelayer FDs

the bls and the rls then b ecomes path to the toplevel dss attribute Each element in the adss also

b ecomes a p ointer to the corresp onding element under the toplevel dss attribute Thus b efore any action

is applied to the draft its DSS layer has already b eing revised These dierent mechanisms for implementing

revision at the DSS layer on the one hand unication and at the SSS and DGS layers on the other revision

rule application is motivated by the fact that the elements of a DSS are not group ed into constituents

whereas the elements of the SSS and DGS are Therefore the addition of a new element at the DSS layer

is necessarily a global op eration whereas at the SSS and DGS layer it is b est viewed as local to a particular

constituent I come back to this last p oint in the next section while describing the revision rule interpreter

There are ve typ es of RHS actions acting up on the revision workspace

addfd fd address inserts fd as a subFD under the path address inside the workspace FD fd

can b e either an atom or a structured FD but not a path

addpath path address inserts path as a subFD inside the workspace FD under the path address

6

cpfd inputaddress outputaddress relo cates the subFD lo cated under the path inputaddress

in the workspace FD and inserts that copy under the path outputaddress

delfd address DELetes the subFD under the path address in the workspace FD

mapfd inputaddress outputaddress mappingtype mappingconstraint

Relo cates the subFD lo cated under the path inputaddress in the workspace FD

Calls either the phrase planner if mappingtype deep or the lexicalizer if mappingtype

surf on that copy

6

cpfd stands for CoPy FD

defconj nominalization

lhs bls materialbasicrescluster

adss attrs results losstreakext

tool nominalization

rhs delfd

addfd surfsemcat rhetor

struct hypotax

root surfsemcat rhetor struct event root transfposs

sss

mapfd rls sss surf fills timerel

addpath rls partic affected partic possessor

cpfd bls sss rels score sss rels score

cpfd bls predmodif score predmodif score

cpfd bls sssroot args agent sssroot args agent

cpfd bls partic agent partic agent

cpfd bls sssroot args affected sssroot args affected

cpfd bls partic affected partic affected

mapfd adss attrs results

sssroot args possessed

deep

root onto indiv

mapfd rls sssroot args possessed partic possessed surf

Figure Enco ding the nominalization revision to ol in streak

Inserts the enriched FD resulting from this call under the path outputaddress in the workspace

FD

Replaces the input to the phrase planner resp ectively lexicalizer lling the dss resp ectively

sss attribute in the enriched FD by a path backp ointing to the input address from which it was

originally copied in order to avoid the intro duction of duplicates in the workspace

The insertions and relo cations p erformed by the actions ab ove are implemented by calling the sp ecial

extension of fuf for nonmonotonic pro cessing of FDs presented in Section B of App endix B To

illustrate the implementation of revision rules as FDs the fuf conjunction enco ding the nominalization

revision rules is given in Fig

The lhs of this conjunction lines sp ecies that nominalization can b e used to

Add a oating fact of typ e losing streak extension the pattern for this typ e of additional content in

given in the losstreakext conjunction at the b ottom of Fig lines

Add such fact onto a draft clause of typ e material which no other oating fact has yet b een attached the

pattern for this typ e of draft sub constituent is given in the materialbasicrescluster material

BASIC game RESult CLUSTER at the top of Fig

At the b eginning of the revision increment the rls for the whole draft is initialized as a copy of the

corresp onding bls The rst action of any nonmonotonic revision is thus to delete the matching draft

sub constituent which is in the case of nominalization the full verb clause to b e replaced by a supp ort

verb clause This is done by the action delfd on line For example this action would result in the

transformation of the sentence

ONeal scored p oints as Orlando defeated Toronto

Main Statistic Cluster Agent Ful l Verb Aected Score

into ONeal scored p oints as

defconj materialbasicrescluster

cat under clause

partic agent sss sss root args agent

affected sss sss root args affected

predmodif score sss sss rels score

sss surfsemcat under rhetor

struct under hypotax

root surfsemcat under encyclo

args agent surfsemcat under encyclo

dss dss args winner

affected surfsemcat under encyclo

dss dss args loser

dss deepsemcat under relation

role under beat

args winner deepsemcat under entity

concept under team

loser deepsemcat under entity

concept under team

rels score surfsemcat under encyclo

dss deepsemcat under entity

concept under score

defconj losstreakext

streakext

rels streakgeneltloser deepsemcat under relation

role under loser

defconj streakext

ents streak streak

streakgenelt deepsemcat under entity

concept under game

rels streakgenelt deepsemcat under relation

role under genelt

args set ents streak

genelt ents streakgenelt

streakext deepsemcat under relation

role under streakextension

args extension under toplevel

streak ents streak

defconj streak deepsemcat under entity

concept under streak

attrs card GIVEN

Figure Testing game result material clauses and losing streak extensions

The actions immediately following delfd involves building the toplevel of the replacement clause The

addfd on lines rst sp ecies a general transfer of p ossession event at the SSS layer The mapfd on line

then calls lexicalizer to map this structure at the DGS layer The FD fills timerels is passed as

a constraint indicating the relevant lexicalization context cases such as this in which the clause structure

to build is to app ear as a dep endent temp oral clause This prevents the lexicalizer from picking a mo o d

incompatible with this dep endency context This call to the lexicalizer and the following addpath action

result in the insertion of a comp osite materialp ossessive clause structure headed by a compatible supp ort

verb eg to hand

At this p oint the draft has b ecome

ONeal scored p oints as handed

Main Statistic Cluster Agent Support Verb AectedPossessor Possessed Score

The cpfd actions on lines copy the sub constituents of the initial clauses that are not left unchanged

by the nominalization At this p oint the draft has b ecome

ONeal scored p oints as Orlando handed Toronto

Main Statistic Cluster Agent Support Verb AectedPossessor Possessed Score

The mapfd action on lines calls the phrase planner to build the internal structure of the NP

expressing the losing streak fact to add to the draft The FD root onto indiv is passed as a

constraint indicating to the planner which ontological p ersp ective on this losing streak is appropriate in the

context of this revision In this case since the streak must b e realized by a nominal the right p ersp ective

is individual The nal mapfd action on line calls the lexicalizer on the SSS that the planner just built

The result is a DGS for a nominal expressing the additional streak fact and which lls the new possessed

role in the revised clause structure

The nominalization has now b een completed resulting in the nal draft

ONeal scored p oints as Orlando handed Toronto their th straight defeat

Main Statistic Cluster Agent Support Verb AectedPossessor Possessed Score

This revision rule example illustrates the fact that the mapfd action allows revision rules to remain

general indicating only where and how to attach the phrase realizing the new oating fact inside the draft

The internal content organization wording and syntactic form of this new phrase is chosen by the phrase

planner and the lexicalizer in the context of the revision when called via a mapfd action

The revision rule interpreter

The task of the revision rule interpreter is to p erform one increment of revision It thus takes as input a

threelayer FD enco ding the current draft called bls for Base Layered Sp ecication and a DSS FD enco ding

the new fact to incorp orate to the draft called adss for Additional Deep Semantic Sp ecication

The incorp oration is p erformed in two stages triggering a revision rule and applying it to the

draft I discuss these two stages in the following two subsections

Triggering a revision rule To insure the scalability of the reviser the set of revision rules

must b e as general as p ossible They must b e as abstract as the revision op erations they implement As

we have seen ab ove one way this is achieved is by enco ding in the revision rules only the attachment p oint

and metho d for the new fact and relying on indep endent mo dules the phrase planner and lexicalizer for

the realization of the new fact as a new linguistic constituent This way the same revision rule for example

adjoin of classifier into nominal can b e used for revision increments such as

Malone scored p oints and Sto ckton dished out assists as Utah defeated Denver initial

draft

Malone scored a season high p oints and Sto ckton dished out assists as Utah defeated Denver

rst revision increment

Malone scored a season high p oints and Sto ckton dished out a franchise record assists as

Utah defeated Denver second revision increment

even though for each increment the meaning and wording of the added phrase in b old is dierent

Another way to keep revision rules general is to have them sp ecify only the type of constituents onto

which a given typ e of new fact can b e attached indep endently of the location where these constituents may

app ear inside the linguistic structure of the draft This is again illustrated by the example ab ove where at

each round the same revision rule is applied but at dierent lo cations inside the draft structure

Relying on revision rules whose semantics are local is desirably mo dular but come with a cost triggering

a revision rule do es not simply involve searching the rule base for an appropriate rule but also searching the

draft linguistic structure for a constituent onto which to apply the rule

The outer lo op of the revision rule interpreter thus consists of a topdown depthrst traversal of the

input draft at the DGS layer For each constituent C encountered during this traversal the interpreter

builds a revision input an FD with three toplevel attributes

bls whose value is a threelayer FD enco ding the DGS SSS and DSS of C

adss whose value is the DSS FD enco ding the new fact to incorp orate to the draft

tool whose value is the name of a sp ecic revision op eration to use for the increment if such a

restriction was passed to the interpreter input as an optional input parameter

The format of this revision input matches that of a revision rule LHS given in the previous section

Searching for an appropriate rule is thus p erformed by unifying such a revision input emb edded under

the attribute lhs with the revision rule base for each constituent encountered during the traversal of the

draft linguistic structure When this unication fails it means that there are no revision rule available to

incorp orate the new fact to the particular constituent C reached at this p oint in the draft traversal The

revision rule interpreter then recurses on the sub constituents of C at the DGS layer building a new revision

input for each of them When this unication succeeds it results is an instantiated revision rule whose LHS

matches the revision input and whose RHS contains the revision actions to apply to C The control regime

of this search is simple the rst constituent onto which the new fact can b e incorp orated is always chosen

Similarly the rst revision rule whose LHS matches the revision input for a given draft constituent is also

always chosen If there is no match b etween any of the revision inputs built for each draft DGS constituent

and any revision rule LHS the input oating fact cannot b e added and the revision interpreter returns to

the revision monitor the input draft unchanged

Consider the rst revision in the example ab ove with the to ol adjoin of classifier into nominal

sp ecied in the input The draft traversal will successively attempt to apply this to ol onto the following

constituents

Malone scored p oints and Sto ckton dished out assists as Utah defeated Denver

Malone scored p oints and Sto ckton dished out assists

Malone scored p oints

Malone

scored

p oints

For attempts and unication with the revision rule base fails b ecause the draft constituent is not

of the syntactic category sp ecied in the input to ol nominal The fourth attempt fails due to a semantic

constraint the prop erty of a given entity can b e only attached to nominals referring to that very entity In

the example at hand the record nature of a statistic can only b e attached to a nominal referring to that

statistic It cannot b e attached to a nominal referring to a player even whose p erformance is summedup

by that statistic as shown by

A season high Malone scored p oints

Applying a revision rule The application of the instantiated rule starts with the building of

the revision workspace In Section we dened this workspace as an FD with four toplevel attribute

bls rls dss and adss Since the application of the instantiated rule must b e local to the draft sub con

stituent C that triggered the rule during the topdown traversal of the draft the bls attribute must b e

initialized as an threelayer FD representing C at all three layers of abstraction The rls attribute can then

b e initialized as a copy of this lo cal bls The dss attribute must remain global ie for the whole draft

however since as explained in Section there is no constituency at the DSS layer The emb edded dss

attribute inside b oth the bls and rls p oints to the particular facts inside the toplevel global dss feature

that C realizes at the DGS layer An example of initial revision workspace for the application of a rule to an

emb edded draft constituent is given in App endix C The address parameters of the revision actions inside

the example rule given in Fig should b e interpreted in the context of a workspace built local ly at the

level of the game result cluster

How should b e implemented in fuf such general revision rules that can b e applied lo cally at various

depths inside the draft linguistic structure The rst metho ds that comes to mind is to use relative paths

as the address parameters of the revision actions However as explained in Section B of App endix B

while such paths are p erfectly ne in an FG in an FD they can intro duce ambiguity We have also seen that

fuf can unify two FDs or one FD with an FG but not two FGs

Since the search for the revision rule to trigger is implemented as the unication of the draft with an FG

enco ding the revision rule base the draft can only b e enco ded as an FD and not as an FG Consequently it

cannot contain relative paths

To get around this diculty the reviser simulates the lo cal application of general rules by translating the

lo cal addresses written in the revision actions into global addresses inside the FD enco ding the whole draft

This global FD is the data structure onto which the actions are physically applied Consider for example

the application of the nominalization revision rule Fig to the game result cluster Orlando defeated

Toronto Since in the DGS of the whole draft sentence ONeal scored points as Orlando defeated

Toronto this game result cluster app ears as a temp oral circumstantial the address of the initial

deletion action on line of this revision rule is translated as circum time

The monitor controlling the revision pro cess The revision monitor is the highest level

comp onent of the reviser and controls the overall revision pro cess It works by rep eatedly p erforming the

following steps

Take the rst element F in the ordered list of oating facts to opp ortunistically incorp orate to the

1

draft

Attempt to incorp orate it to the current draft D by calling the revision rule interpreter on the pair

0

D F

0 1

a If a revision rule matches this input pair the interpreter returns a revised draft D expressing

1

the content of b oth D and F pro ceed to step b elow

0 1

b Otherwise the interpreter returns back D start over from step ab ove with the next element F

0 2

in the ordered list of oating facts F has b een ignored for lack of a revision rule to incorp orate

1

it to the draft

Extract the DGS layer from D and pass it to surge

1

Insp ect the resulting sentence measuring b oth its lexical length and syntactic depth

a If the length is b elow words and the depth b elow emb edding levels output the sentence

start over from step ab ove with the next element F in the ordered list of oating facts F has

2 1

b een successfully incorp orated to the draft

b Otherwise output a warning message indicating that the maximum sentence complexity has b een

reached and stop the revision pro cess F has b een ignored for lack of space

1

There are two remarks to b e made concerning this simple control mechanism First the list of oating

facts is actually a list of lists where each sublist contains a set of related oating facts This is the case b ecause

oating facts esp ecially historical ones need an anchor p oint to b e included into the draft For example the

record prop erty of an additional game statistic cannot b e incorp orated to the rep ort b efore the incorp oration

of the statistic itself These two oating facts will thus b e put in the same sublist Second b ecause the

revision interpreter always picks the rst draftsub constituent revisionrule pair that matches the

monitoring algorithm ab ove may in some cases pro duce a sentence that is suboptimal in the sense that it

may not contain the maximal numb er of oating facts that could b e t within the complexity limits When

the threshold is reached it may b e the case that had another draftsub constituent revisionrule pair

b een chosen for some earlier revision it would have resulted in a more compact form ultimately allowing

the addition of another oating fact without reaching the threshold The implementation of a backtracking

facility within the reviser that would avoid such situations It however has b een left for future work I come

back to this issue in Section

STREAK at work two example runs

Having seen how each comp onent of streak works separately in the previous three sections I now comment

two example runs showing how they work together to generate a summary The rst example run illustrates

how streak generates a complex lead sentence by rst pro ducing a simple sentence containing only the

obligatory xed facts and then incrementally incorp orating the complementary oating facts through a series

of revisions It also shows how streak controls the revision pro cess and decides when to halt it The second

example illustrates how streak takes into account the surface form of the current draft to cho ose which

revision op eration to use for incorp orating a new oating fact into the draft In this example streak rst

builds two dierent draft forms from the same set of input xed facts It then incorp orates the same input

oating fact to the dierent draft forms using a dierent revision rule in each case

In order to prevent this section from growing overlengthy further example runs illustrating other inter

esting asp ects of the system are given in App endix C This app endix contains

A set of draft building stages that illustrates the paraphrasing p ower enco ded in the phrase planner

and the lexicalizer by pro ducing a variety of draft forms from the same set of input xed facts

A set of parallel revision increments that illustrates the paraphrasing p ower enco ded in the reviser

by showing how the same oating fact can b e incorp orated into the same initial draft form by using

dierent revision rules resulting in a variety of revised draft forms

Another full run example that illustrates the lo cality of the revision rules by rep eatedly applying the

same rule at a dierent levels of the current draft structure for dierent revision increments during the

generation of a complex lead sentence

The fullest account of the two example runs presented in this section can b e found in that app endix

where they are rep eated but this time in conjunction with the FDs enco ding the complete semantic input

andor draft representation

Chaining revisions to generate a complex sentence from a simple draft

Example Run shown in Fig illustrates three asp ects of streak

How it generates a complex lead sentence by rst pro ducing a simple sentence containing only the

obligatory xed facts and then incrementally incorp orating the complementary oating facts through

a series of revisions

How it controls the revision pro cess and decides when to halt it

The variety of revision to ols it implements since a dierent one is used at each generation increment

Building an initial draft

In Fig the rst line contains calls to the toplevel functions for the draft and revision passes As input

the function draft takes three arguments

The DSS FD containing the input xed facts to convey in the draft In the example of Fig this

input DSS is built by calling the function dssF The co de of this function is given in Fig C of

App endix C

A partial SSS constraining the form of the output draft by sp ecifying some desired features in the

initial draft plan The phrase planner can only cho ose options that are compatible with this partial

sp ecication of its output This argument is optional and intro duced by the keyword sss These

surface form constraints are generated by calls to functions such as formflag describ ed b elow

A partial DGS constraining the form of the output draft by sp ecifying some desired features in the

initial draft skeletal lexicosyntactic tree The lexicalizer can only cho ose options that are compatible

with this partial sp ecication of its output This argument is optional and intro duced by the keyword

7

dgs

draft returns an threelayer FD enco ding the initial draft and as a side eect prints the natural language

sentence resulting from the unication of this FD with surge

The function revise takes three arguments

An threelayer FD enco ding the initial draft

A list of list of oating facts to attempt to incorp orate to the draft Each sublist contains a set of

related oating facts These sublists are in ordered by decreasing imp ortance of the facts they contain

Floating fact lists of lists are generated by calls to functions such as floatstackF The co de for this

function is shown in gures C to C of App endix C

The verbose flag which when set to T prints after each revision increment the lexical length

lexnum and syntactic depth depth of the revised draft

The value of the formflag parameter passed to the draft function in Fig is

rels cooccur none

time elts cdr car struct hypotax

As explained in Section an initial draft contains four xed facts the main statistic of a player from

the winning team the result of the game including its nal score its lo cation and its date Following the

corpus observations streak always convey the lo cation as a header The sentence itself thus contains three

main constituents one for each of the three remaining xed facts The form ag ab ove the typ e of rhetorical

relations that the phrase planner can use to group these three constituents The one ab ove sp ecies that

The draft must b e hyp otactically structured indicated by the presence of a rels feature at the top

level

Two dep endent constituents must b e group ed in an paratactic structure itself linked to the main

constituent by a temp oral relation indicated by the presence of a feature time elt under

the rels feature

The second element in this paratactic structure must b e itself hyp otactically structured indicated by

the emb edded struct hypotax feature

When unied with the p ossible toplevel draft structures observed in the corpus and enco ded in the

phrase planner of streak this sp ecication results in sentences like Draft at the top of Fig In such

7

The two optional argument sss dgs provide a basic facility to systematically test the paraphrasing p ower of streak

The development of more p owerful facilities has b een left for future work and is discussed in Section

revise draft dssF sss formflag floatstackF verbose T

Draft

Dallas TX Charles Barkley registered points Friday night as the

routed the Dallas Mavericks

Draft lexnum depth

Dallas TX Charles Barkley registered points Friday night as the Phoenix Suns

handed the Dallas Mavericks their th defeat in a row at home

Draft lexnum depth

Dallas TX Charles Barkley registered points Friday night as the Phoenix Suns

handed the Dallas Mavericks their franchise worst th defeat in a row at home

Draft lexnum depth

Dallas TX Charles Barkley registered points and Danny Ainge added

Friday night as the Phoenix Suns handed the Dallas Mavericks their

franchise worst th defeat in a row at home

Draft lexnum depth

Dallas TX Charles Barkley registered points and Danny Ainge came

off the bench to add Friday night as the Phoenix Suns handed the Dallas

Mavericks their franchise worst th defeat in a row at home

Draft lexnum depth

Dallas TX Charles Barkley matched his season record with points

and Danny Ainge came off the bench to add Friday night as the Phoenix

Suns handed the Dallas Mavericks their franchise worst th defeat in a row

at home

Draft lexnum depth

SSS DSS

Figure streak generating a simple draft and incrementally revising it into a complex one

sentences the main clause conveys the main statistic with a list of two dep endent constituents as a temp oral

adjunct The rst element of this list is the nominal conveying the date of the game and the second is the

clause conveying the game result

Revising the initial draft

The rst subelement in floatstackF is an historical background fact of typ e streak extension It notes

that the rep orted game marks the th time that Dallas is defeated on their home turf To add this fact to

Draft streak uses the nonmonotonic Nominalization revision rule whose co de was given and explained

in detail in section This rule is applied to the initial draft game result clause the Phoenix Suns

routed the Dal las Mavericks It replaces the full verb clause pattern WINNER rout LOSER conveying the

game result in Draft by the semantically equivalent supp ort verb clause pattern WINNER hand LOSER

a defeat in Draft Since the game result is now realized by an NP the expression of its consequence the

up dated length of Dallas losing streak can b e concisely conveyed by adjoining the discontinuous ordinal

th in a row mo difying the NP head defeat The restriction of this streak to home games is

conveyed adjoining another mo dier the PP qualier at home The variety of nominalization rule used

for this revision is thus Nominalization with Ordinal and Qualifier Adjoin

The second subelement in floatstackF is another historical background fact of typ e record breaking

It brings additional information ab out the preceding streak extension fact by noting that as a result of this

latest extension this streak is now of record length To add this fact to Draft streak uses the monotonic

revision rule Adjoin of Classifier to Nominal This rule is applied on the very nominal that was created

during the preceding nominalization revision their th defeat in a row at home mo difying it with the

classier franchise worst that expresses its record breaking nature This illustrates how the choice of a

revision rule for a given increment constrain in some cases the range of choices for subsequent increments

This typ e of Adjoin revision rule allows for a very concise expression of the added oating fact It do es not

change the syntactic depth of the draft and lengthens it by only two words After this addition the draft is

only words long still comfortably b elow the word limit observed in the corpus

The third subelement in floatstackF is a nonhistorical fact of typ e additional statistic To add this

fact to Draft streak uses the monotonic revision rule Coordinative Conjoin of Clause This rule was

applied to the main statistic clause Charles Barkley registered points b ecause this additional statistic

also concerns a player of the winning team Furthermore since they are b oth scoring p erformances streak

exploits this fact and cho oses to elide the head of the ob ject in the added conjoined clause resulting in the

phrase Danny Ainge added instead of Danny Ainge added points This illustrates how streak

opp ortunistically takes advantage of the particular draft context into which a oating fact is woven to cho ose

a more concise expression for that fact streak also uses this context to make the most appropriate lexical

choice as illustrated by the choice of the verb to add for this second statistic Such a verb can b e chosen

only in this particular context It would b e inappropriate for example to realize the main statistic for which

streak chose the more general verb to register in this particular run The co de for the choice of such

verbs was given in Section How the reviser passes such contextual information to the lexicalizer was

also explained in that section

The fourth subelement in floatstackF is nonhistorical fact of typ e player status Just as the second

oating fact underlined the signicance of the rst by conveying its record breaking nature this fourth fact

underlines the signicance of the third It notes that the player whose scoring statistic was just added to the

8

draft Danny Ainge added is a reserve player To add this fact to Draft streak uses the monotonic

revision rule absorb of clause into clause as a result adjunct Moreover it uses the sp ecialization

of this revision rule that involves the side transformation Agent Control This sp ecialization is chosen when

streak notices that b oth the absorb ed and absorbing clauses share the same agent It allows the agent

of the absorb ed one which was part of the original draft to b e deleted resulting in Danny Ainge came

o the bench to add instead of Danny Ainge came o the bench for Danny Ainge to add thus

opp ortunistically gaining space and uency This example illustrates the capability of streak to use idioms

such as the expression to come o the bench which conveys that a player is a reserve

8

Making the fact that he scored that many p oint all the more remarkable

The fth subelement in floatstackF is a historical fact of typ e record equalling It concerns the

main statistic To add this fact to Draft streak uses the revision rule adjunctization of created

into instrument It moves the ob ject of the main statistic clause that was lling the created role in that

9

clause to an instrument role in order to accommo date the added record as ob ject The equalling asp ect

of this record is expressed as the new main verb to match replacing the original verb to register The

action explicitly conveyed by this original verb is now implicitly conveyed by to match since matching

a record can only come as a consequence of a p erformance This example thus illustrates the ability of

streak to opp ortunistically take advantage of the addition of a new fact to gain space by making part of

the realization of another fact already in the draft implicit It also demonstrates how streak takes into

account stylistic conventions observed in the corpus Compare the addition of this fth oating fact with

the addition of the second one They b oth concern a record the dierence b etween them b eing that the

second fact expresses that a record was broken and the fth one that it was merely equal led This dierence

which could seem minor at rst triggers the use of entirely dierent revision to ols the monotonic Adjoin

for the second fact and the nonmonotonic Adjunctization for the fth one This dierence in strategies

implements the stylistic convention observed among sp orts or sto ck market writers that mentioning of a

record up date event without explicitly sp ecifying whether it is of the breaking or equalling typ e implies that

it is of the breaking typ e This convention allows streak to use the simple and concise revision to ol Adjoin

for record breaking events note how nothing in Draft sp ecies whether the th defeat of Dallas actually

breaks or merely equals their longest losing streak Using such an implicit form for record equalling events

as well would b e misguiding however The need to keep rep orts concise must b e balanced with the need to

keep these two typ e of events unambiguously distinguishable It is in order to b e explicit ab out the equalling

typ e of the record up date event added in the fth increment that streak uses the less concise and more

complex Adjunctization revision

After the addition of this fth oating fact the draft is only two words away from the maximum length of

observed in the corpus Thus unless the next subelement in floatstackF can b e added with only two

more words it will not t in this lead sentence summary This next subelement is an additional statistic

the passing p erformance of Danny Ainge The most concise way it can b e accommo dated in the draft is by

revising the nominal realizing the scoring statistic of this player that was added during the third revision

increment and which is already reduced to the cardinal numb er streak applies the revision rule

Coordinative Conjoin of Nominal to this nominal yielding to add and assists This revision

thus adds three new words in b old while not deleting any and thus pushes the revised draft over the

length limit streak thus halts the revision pro cess without printing the draft resulting from this nal

revision As nal value the revise function returns the threelayer FD representing the previous draft

which was under the complexity limits Since this FD is very large its b o dy is not shown in Fig but

its presence is signaled by the abbreviation SSS DSS The full detailed b o dy for this nal

draft representation is given in App endix C however

Choice of revision rule constrained by the surface form of the draft

Example Run shown in Fig illustrates how in streak the choice of a revision rule to add a given

oating fact onto the draft is sensitive not only to the content of the draft but to its surface form as well

This run starts with two calls to the function draft to build two alternative draft forms from the same input

but a with dierent realization constraint The common input is a DSS FD enco ding the four xed facts

to convey in b oth draft It is built by calling the function dssC For the rst call to draft the realization

constraint is formflag whose value was given in the previous section It constrains the output draftC

to express the game result as a ful l verb clause that follows the pattern WINNER fullverb LOSER and

is sub ordinated to the main statistic clause as a time adjunct In this particular run the full verb chosen

randomly by the lexicalizer is to beat

For the second call the realization constraint is formflag whose value is

rels cooccur none

9

cf sections B and B of App endix B for the denition of the thematic roles used in streak

time elts cdr car struct event

It constrains the output draftC to express the game result this time as a support verb clause following the

10

pattern WINNER supp ortverb LOSER nominal In this particular run the supp ortverbnominalhead

collo cation chosen randomly by the lexicalizer is to nail down a win

Once these two alternative synonymous draft sentences have b een built the function revise is then

called on each of them with the same additional oating fact adssC as second parameter revise is the

function to call for a single revision increment It implements the revision rule interpreter describ ed in

Section In contrast the function revise called for example Run is used for chaining revisions and

implements the revision monitor describ ed in Section revise works by traversing the list of lists of

oating facts and rep eatedly calling revise on each fact adssC enco des a losing streak extension for the

Boston Celtics To incorp orate this oating fact to draftC streak uses the revision rule Nominalization

In contrast to incorp orate this same oating fact on the synonymous but linguistically distinct draftC

streak uses the revision rule Adjunctization In each case the choice of one revision rule over the other

is motivated by the surface form of the resp ective drafts involved Nominalization realizes the new fact by

mo diers attached to a nominal resulting from the transformation of a fullverb clause into a supp ortverb

clause Adjunctization conversely replaces a supp ortverb clause by a fullverb clause incorp orating the

new fact by a fullverb and a new ob ject while displacing the original ob ject to an adjunct p osition Since

draftC follows a fullverb pattern only Nominalization and not Adjunctization is applicable to it For

draftC following a supp ort verb pattern it is just the opp osite There is no game result NP in draftC

to b e adjunctized and no game result full verb in draftC to b e nominalized It is precisely b ecause the

applicability of revision rules such as the two ab ove is dep endent on surface form the presence of the DGS

layer in the draft representation of streak is required

STREAK by the numb ers

streak was implemented on top of an extended fufsurge package In this extended package fuf

11

consists of K of commonlisp co de including K of entirely new co de for the nonmonotonic

manipulation of Functional Descriptions FDs and surge consists of K of fuf co de including K

of entirely new co de for the extended system for adverbial clause elements

These two extensions to the fufsurge package were essentially preparatory work paving the way for

the core implementation of streak This core consisted of implementing the revision rule interpreter and

enco ding the linguistic data compiled during the corpus analysis as three declarative knowledge sources

The phrase planning rule base

The lexicalization rule base

The revision rule base

The revision rule interpreter consists of K of b oth commonlisp and fuf co de Each rule base is

enco ded as a Functional Grammar FG As explained in App endix B an FG is a disjunction of conjunctions

of features with each feature p otentially a recursive disjunction of conjunctions of subfeatures An FG can

thus b e viewed as an andortree of options and the b est way to quantify the numb er of cases covered in an

FG is to count the numb er of disjunctionfree conjunctions that it contains Each such conjunction is only

one level ab ove the b ottom of the tree and corresp onds to a disjunctionfree rule The FG for phrase planning

enco des ab out such rules the one for lexicalization enco des ab out such rules covering the

12

various senses and thematic usages of op enclass words in the domain sublanguage and the one for

revision enco des ab out such rules

10

Like formflag formflag also constrains the game result clause to b e sub ordinated to the main statistic clause as

a time adjunct

11

Written by Elhadad

12

These words include neither prop er nouns and quantitative values which are passed from the input to the generator nor

function words which get added by surge

setf draftC draft dssC sss formflag Draft form

Hartford CT Karl Malone hit for points Friday night as the Utah

Jazz beat the Boston Celtics

SSS DSS

setf draftC draft dssC sss formflag Draft form

Hartford CT Karl Malone notched points Friday night while the

Utah Jazz nailed down a win against the Boston Celtics

SSS DSS

revise draftC adssC Revising Draft using Nominalization

Hartford CT Karl Malone hit for points Friday night as the

Utah Jazz brought the Boston Celtics their sixth consecutive setback at home

SSS DSS

revise draftC adssC Revising Draft using Adjunctization

Hartford CT Karl Malone notched points Friday night while the Utah

Jazz extended the Celtics homecourt losing streak to six with a win

against Boston

SSS DSS

Figure streak using dierent revision rules dep ending on the surface form of the draft

These three FGs resp ectively o ccupy K of fuf co de for the phrase planning rule base K of fuf

co de for the lexicalization rule base and K of fuf co de for the revision rule base The fact that the

lexicalization FG enco des several orders of magnitude more rules than the phrase planning FG while b eing

essentially of the same size illustrates the high level of co de compactness achievable by relying on the the run

time comp ositional instantiation through recursive functional unication of rules whose common features

are shared inside FG fragments

Chapter

Evaluation

In this chapter I quantitatively evaluate several asp ects of the research presented in this dissertation Dier

ent asp ects are evaluated along dierent dimensions using dierent test corp ora After surveying the general

problematic of evaluation in the context of language generation I rst describ e the use of two new basketball

rep ort corp ora to evaluate

The coverage of the corpus analysis results ontology realization patterns and revision to ols presented

in Chapter

The robustness of the corpus analysis metho dology presented in the same chapter as a knowledge

acquisition approach

The gains in terms of b oth coverage and robustness of the revisionbased approach to generation

presented in Section over the classic oneshot approach

I then describ e the use of a sto ck market rep ort corpus to evaluate the crossdomain p ortability of the

revision to ols presented in Section of Chapter

Evaluation and language generation

The diculty of the issue

Although a p otentially vast eld in its own right evaluation remains to date an almost completely untouched

area of generation research Among the landmark dissertations centered around the development of a gen

eration system only one contains an entire chapter sp ecically dedicated to evaluation Kukichs Kukich

Her system ana and Elhadads advisor I I Elhadad b are the only two generation systems that

have b een the ob ject of quantitative evaluation These evaluations are describ ed and contrasted with the

evaluation work presented in this thesis in Section

The paucity of evaluation eorts in generation is ro oted in the extremely challenging nature of the task

There are multiple reasons for this diculty notably

The variety in input representations target outputs and application domains of existing generation

systems limiting the p otential for comparative evaluations

The high sub jectivity of what constitutes a go o d text limiting the p otential for absolute qualitative

evaluations

The frequent unavailability of large systematic InputOutput test sets limiting the p otential for

quantitative evaluations

In the following subsections I review each of these diculties in turn I then explain how some of them

can b e alleviated when textual corp ora are available

The sub jectivity of text quality

Evaluating writing even human writing has always b een a thorny issue While humans routinely make

judgments ab out the quality of the prose they read these judgments are highly sub jective Even such

basic evaluation attempts as assigning a grade level to a text has b een the ob ject of endless controversies

This is due to the fact that judgments of written prose rely on a vast array of implicit and goaldep endent

criterion and draw up on many vague and intuitive notions Making these goals and criterion explicit and

rigorously dening these notions is an intriguing but daunting task It is however a prerequisite for the

absolute evaluation of the quality of texts whether generated by computers or humans in a systematic

and ob jective way

The variety of input representations and target output texts

1

Comparative evaluations are just as problematic as absolute ones

The rst diculty for comparing the resp ective merits of dierent generation systems is the nature of

their input With a few exceptions eg standalone p ortable syntactic comp onents generation sides of

transferbased translation systems rep ort generators working all the way from numb er tables the input to

a generator consists of a semantic representation of the content to convey sometimes annotated with com

municative intent No broadly accepted standard representation scheme for semantic content and intentions

has yet emerged from knowledge representation research Comparing generators working from drastically

dierent input representations is therefore a murky business

2

The second diculty is that until now no two generators have b een conceived and implemented to

pro duce the same set of target outputs Existing systems have b een develop ed for dierent domains and

in the rare cases when several generators exist in the same domain they use a dierent language eg

ana Kukich in English and frana Contant in French for the sto ck market domain Dierent

domains do not only mean dierent encyclop edic contexts but also dierent writing styles Whereas a

telegraphic style is wellsuited for a weather forecast it would b e completely inappropriate for a business

letter Similarly whereas a factpacked and highly metaphorical style is b est for newswire sp orts stories it

could not b e used for technical do cumentation Indeed it is interesting to note that the newswire summaries

of the corp ora I analyzed did not follow the stylistic heuristics that writing exp erts recommand for improving

the clarity of scientic pap ers For example supp ort verbs were p ervasive in the corpus and the sub ject head

noun was often separated from the verb by lengthy relative clauses or app ositions These two expressive

forms are strongly ob jected by Gop en and Swan for scientic writing This discrepancy in style can

b e explained by a discrepancy in goals In a science article the main goal is to intro duce the reader to

new complex concepts In a newswire story it is rather to pack information concerning to familiar simpler

concepts Since in the latter case the reader is not burdened with the complexity of the content she can

handle the more complex prose which allows tting many facts in a short space

The unavailability of systematic I test sets

Not only do generators lack a reference standard for input and output but systematic test sets of either are

not always available

Due to the overall complexity of the generation task many recent research eorts like the one presented

here have fo cused on restricted subtasks in order to attack them in depth These eorts thus resulted in the

1

At least those that concern particular systems and are carried out directly on output texts In Section I present a

comparative evaluation of two generation models carried out on the dierent know ledge structures which are used by each

mo del

2

To the b est of my knowledge

implementation of generator components as opp osed to complete text generation systems In such situation

some of the implemented comp onents will inevitably work from an input representation that comes in the

overall generation architecture from other comp onents whose task is out of the research scop e at hand and

thus unimplemented Evaluating such comp onents can only b e p erformed by using handco ded and thus

necessarily small input test sets

In some applications a systematic test set is not available not only for the input to the generator but for

its target output as well While some generation applications aim at saving the cost of using humans exp erts

with other pressing resp onsibilities from writing texts other applications aim at providing do cumentation or

summaries that are needed yet currently missing for lack of available writing manp ower In the latter case

there is no corpus of humanwritten target texts available onto which to test the output of the generator

The situation is similar for online help systems the very need for a generator resp onding dynamically to

a sp ecic dialog situation instead of simply displaying prestored text precludes the use of manual pages

as mo del texts For such applications a set of target resp onses have to b e initially handco ded and then

incrementally adjusted through user feedback and determining the quality of the generated resp onse requires

determining how well it meet the user needs Prop osed evaluations Hirschman and Cuomo often center

around time to task completion where the user is given a task that requires use of the system to solve Such

evaluations are problematic b ecause the ease with which a user can request information also aects the result

and thus do not allow assessing the resp ective eects of the understanding and generation comp onents

The opp ortunity created by corpus data

For generation applications where human generated text corp ora are available as mo del output sets setting

up an evaluation is much easier One reason why ana Kukich could b e quantitively evaluated was

that b oth its input numb er tables and target output newspap er articles were available in a systematic

way

Corpus data allows b etter circumscription of the goal of a generation system it is to pro duce texts that

match as closely as p ossible the texts from the corpus It allows viewing the writing pro cess as a black

b ox Such a view fo cuses on the objective task of observing what human writers generate in a given situation

and avoids the more subjective task of speculating why they do so The quality of a text pro duced by a

generation system can then b e dened in terms of how distinguishable it is from a text that a professional

writer would pro duce from the same input data When the corpus is large enough such an evaluation can

b ecome quantitative When mo del texts from multiple writers are available stylistic idiosyncrasies can b e

ltered out by enco ding in the generator only the most commonly used expressive forms A corpusbased

approach shifts the task of the evaluation from dening what constitutes a go o d text to assessing the

representativity of the textual corpus used as mo del For this latter task developing automatic or at least

semiautomatic metho ds is much more feasible

The rising interest in the issue

While evaluation has b een largely ignored by the generation community up to now it is b ound to b ecome a

central issue fairly quickly Two dierent sets of forces are at play in this emergence The rst set is internal

to the eld The second is external I briey discuss each in turn in the following paragraphs

Maturity of the eld

Natural language generation seems to have reached a turning p oint in its maturation After the initial

exploratory phase when most research consisted in discovering new problems and providing initial solutions

to them more eorts are now b eing put into taking a second and more fo cused lo ok at wellknown problems

to improve on the initial solution Since the contribution of such research do es not lie in the originality

of the problem itself new approaches must b e compared convincingly and advantageously with previous

ones This is encouraging the development of comparative evaluations It is also promoting quantitative

evaluations Even though their real signicance may at times b e questionable evaluations that can b e

summed up by a few numb ers tend to convince broader audiences than qualitative evaluations which require

deep understanding of subtle details

Inuence of related elds

The vogue of comparative and quantitative evaluations is also spreading from their p opularity in elds close

to generation namely language understanding and statistical NLP Events entirely dedicated to quantitative

evaluations such as the Message Understanding Conference MUC have now b een held for several

years in natural language understanding a eld that is older than generation and b eneciates from a larger

3

workforce Evaluation has b een a central issue from the start in statistical NLP a line of work aiming at

developing to ols that do not rely on any handco ded knowledge but are instead trained using weak learning

heuristics on very large textual corp ora This early emphasis on evaluation in statistical NLP is ro oted in the

two fundamental dierences that sets it NLP apart from its knowledgebased counterpart First statistical

NLP is exp erimental in nature while knowledgebased NLP is analytical This exp erimental avor where

simple approximations are quickly tried out on data allows to invest more time in evaluation than when

exact knowledge is painstakingly abstracted from indepth data analysis Second statistical NLP seeks to

attain wide coverage at the exp ense of accuracy while knowledgebased NLP opts for the other side of this

tradeo Inaccuracy b eing more immediately visible than brittleness the issue of evaluation b ecame crucial

more rapidly

Because of these fundamental dierences in goals b etween these two elds evaluation techniques that

originated in statistical NLP cannot b e used as is for knowledgebased NLP Both the ob ject and the

metho d of evaluation need to b e carefully adapted In terms of what should get evaluated eorts are b est

fo cused on the weak p oint of each eld Consequently while in statistical NLP evaluation schemes should

emphasize accuracy over robustness in knowledgebase NLP evaluation schemes should do just the opp osite

In terms of evaluation metho ds although textual corp ora can b e used in b oth cases each eld use these

corp ora in a dierent way In statistical NLP corp ora are input data from which to automatically extract

shallow approximations of deep knowledge Consequently corp ora need to b e very large in order to determine

whether the correlation b etween the prop osed shallow approximation of the deep knowledge and the deep

knowledge itself is statistically signicant In knowledgebase NLP corp ora are analysis material from which

to manually extract the deep knowledge itself Corp ora size requirements are thus greatly reduced since

with the help of human exp ertise deep knowledge can b e acquired from textual corp ora of mo dest size

The diversity of the issue

Having surveyed b oth the diculty and the growing imp ortance of the issue of evaluation in language

generation I now turn to its diversity I identify various dimensions along which dierent typ es of evaluations

can b e characterized Some of these dimensions concern the particular object of the evaluation while others

concerns the methods used for the evaluation I review each in turn in the following paragraphs

What to evaluate

Evaluating a generation system can involve evaluating either

The issues that it addresses or the solution that it provides

The underlying mo del on which it is based or its particular implementation

Its robustness or the accuracy of its output

3

It is imp ortant to keep in mind that the recent revival of the use of statistics in place of deep er knowledge in NLP research

has fo cused on applications such as translation summarization and information retrieval The rationale is not to use statistics

to perform either understanding or generation but to hop e that for such applications b oth tasks can b e somehow bypassed

How well andor robustly do es it p erforms the various subtasks of language generation content deter

mination content organization content realization

When evaluating a research prototyp e whether in language generation or not one must distinguish

b etween evaluating the imp ortance of the sp ecic issues addressed by the system and evaluating the solutions

that the system brings to these issues For example the p ercentages of corpus sentences with oating concepts

andor historical information presented in Chapter evaluates the imp ortance of these two issues These

p ercentages do not however evaluates the solution that streak oers to these issues

In evaluating a solution it is also imp ortant to distinguish b etween evaluating the underlying generation

model of the system and its particular implementation For example in Kukich Kukich quantitatively

evaluates the coverage of the generator ana which is a particular implementation of the general onepass

macro co ded generation mo del that she prop oses The mo del in itself is not quantitatively evaluated in its

abstract generality

A generator can b e evaluated directly in terms of its output or more abstractly in terms of the know ledge

structures that it relies on to pro duce this output It is also quite a dierent issue to evaluate the robustness

of a generator and to evaluate the accuracy of the texts that it pro duces These last two contrasts are related

For example a generator using canned text can pro duce texts that p erfectly mimic the corresp onding texts

generated by human writers for a small sample of input data However such a system will break down when

presented with input data outside of that small sample In contrast a generator relying on highly abstract

and comp ositional knowledge structures will b e robust enough to pro duce satisfactory texts from input data

outside the initial set of data that was used to acquire these structures

Robustness can b e decomp osed in dierent facets

Coverage how much of a given domains total sublanguage is covered by the enco ded knowledge

structures

Extensibility how many more such knowledge structures would b e needed to cover the whole sublan

guage

Portability how domain sp ecic are those knowledge structures

Similarly accuracy can also b e decomp osed in dierent facets

Syntactic dealing with the grammaticality of the generated sentences

Semantic dealing with the mapping from conceptual content to linguistic form

Lexical dealing with constraints among words such as collo cations

Discursive dealing with factors such as coherence and high level textual organization

Stylistic dealing with factors such as conciseness and readability

Interpersonal dealing with factors related to the intended reader

Inaccuracies of the rst three typ es ab ove and even more strongly for rst two tend to b e so blatant that

they are rarely even discussed in the research literature They are in general seen more as bugs than as

inaccuracies Consequently only a p olished version of a generator that is essentially free of such inaccuracies

is considered a complete implementation Most of the discussion on accuracy fo cuses on the last three facets

How to evaluate

Having reviewed the various asp ects of a generator that can b e evaluated I now survey the dierent metho ds

that can b e used for each of these asp ects Evaluation metho ds can b e either

Qualitative or quantitative

Absolute or comparative

Based on human judgments or on corpus data

Manual or automatic

An example of qualitative vs quantitative evaluations in the context of language generation can b e found

in Kukich p A qualitative evaluation of the syntactic coverage of ana is rst given listing

all syntactic forms observed in a corpus of sto ck market rep orts and sp ecifying which forms are enco ded

in the generator eg participial clauses and which are not eg innitive clauses It is followed by a

quantitative evaluation of syntactic coverage counting the prop ortion of corpus sentences containing only

constructs enco ded in ana

These two evaluations are absolute concerning only ana In Section of the present thesis I present

a comparative evaluation which contrasts the resp ective robustness of the twopass micro co ded generation

mo del on which streak is based with the onepass macro co ded generation mo del on which previous rep ort

generators such as ana Kukich are based

All the evaluations mentioned until now are based on corpus data One could also imagine evaluating

a generator by having a human exp ert eg a sp orts writer in the case of streak lo oking at a test set

of generated texts and counting the prop ortion of sentences that are accurate and complete and optimally

concise in the case of a summarization system This typ e of evaluation relying on one or several human

judges is commonly used in statistical NLP For example it has b een used to evaluate systems which compile

collo cations Smadja or form groups of semantically related adjectives Hatzivassiloglou and McKeown

However in the context of language generation this typ e of approach would not yield very interesting

results This is due to the fact that using canned text output accuracy can b e trivially achieved at the

exp ense of robustness which is the central issue for a generator as it is for most knowledge based systems

In addition such metho ds cannot b e automated Given the restricted accessibility of human exp erts purely

manual evaluation metho ds tend to b e impractical since they cannot b e rep eated on a regular basis to

measure the impact of gradual changes made during the life cycle of the system Semiautomatic corpus

based evaluation schemes seem thus preferable

The evaluation approach of this thesis

Having surveyed the dimensions along which evaluation can vary I now situate along these dimensions the

particular evaluations that I carried out for the research presented in this thesis

There are two ob jects of evaluation

The new twopass micro co ded language generation mo del presented in Chapter

The hierarchy of revision rules presented in App endix A and resulting from the corpus analysis pre

sented in Chapter

I thus evaluate the new typ e of deep know ledge structures required by this mo del as opp osed to the

output of the prototyp e system which relies on these structures to generate summaries I also evaluate

the generation model prop osed in this thesis as opp osed to the particular implementation of this mo del in

the streak prototyp e I carry out three separate evaluations Each one measures a dierent asp ect of

robustness coverage and extensibility of the revisionbased generation mo del and p ortability of the revision

rules

The fact that the evaluation concerns the know ledge structures needed to generate the texts of the

output texts themselves is crucial since as noted ab ove a generator using canned text can pro duce texts

that p erfectly mimic the corresp onding texts generated by human writers for a small sample of input data

However such a system will break down when presented with input data outside of that small sample In

contrast a generator which relies on abstract comp ositional knowledge structures should b e robust enough

to pro duce satisfactory texts from input data outside the initial set of data that was used to acquire these

structures The question is how robust

The second key feature of this evaluation is that it concerns a general model and not a particular im

plementation It is this feature that sets this evaluation apart from previous evaluation work and make its

results of considerably wider relevance As noted in section Kukich evaluates the coverage of the

system ana a particular implementation of the onepass macro co ded generation mo del that she prop oses

While necessary for the development of a realworld system the results of this typ e of evaluation primarily

dep end on the amount of eort dedicated to handco ding the particular knowledge structure needed by the

implementation in the case of ana a lexicon of sto ck market phrases But since this eort would have

to b e duplicated for each new domain such results provide little insights on the general applicability of the

underlying generation metho dology Such insights are gained only by evaluations p erformed at the mo del

level indep endently of a particular implementation

I what follows I evaluate streaks underlying model in a way that is

Quantitative

Based on corpus data

Both absolute and comparative

Semiautomatic

In the following sections I describ e an adaptation of the traditional trainingtest corpus scheme for

quantitative evaluation to the corpus analysis for generation knowledge acquisition presented in Chapter

Considering the initial basketball rep ort corpus on which this analysis was p erformed as the training corpus

I analyze two other basketball rep ort corp ora and a sto ck market rep ort corpus as the test corp ora Since

the initial corpus did not literally serve as input data for automatically training a system but instead

as analysis data for manually acquiring the systems knowledge it is more appropriate to refer to it as the

acquisition corpus than as the training corpus

Using this approach I present three distinct evaluation eorts The rst denes a set of parameters

assessing the limit in coverage of the whole target sublanguage of streaks application domain that can b e

attained by analyzing a one year sample of the sublanguage This rst evaluation is comparative Dierent

parameters are used for measuring the impact on such coverage limit of relying on the knowledge struc

tures resp ectively needed by a twopass micro co ded generator such as streak and a onepass macro co ded

4

generator such as ana or semtex

The second evaluation denes a set of similar parameters but this time measuring the extensibility of

the resp ective approaches It is also comparative The coverage parameters answer the question with the

knowledge structure acquired by analyzing a year sample of the sublanguage how many sentences from a

dierent year sample can a system which relies on these knowledge structures generate In contrast the

extensibility parameters answer the question how many new knowledge structures would the generator need

in order to also fully cover a dierent year sample

The third evaluation do es not directly concern the new generation mo del prop osed in this thesis but

instead the new typ e of linguistic knowledge that is needed for implementing the mo del revision rules to

incorp orate additional content in simple draft sentences This last evaluation is absolute It estimates the

prop ortion among the revision rules acquired for the initial sp orts application domain that are usable in

another domain nance This third evaluation measures the portability of those rules

Obtaining the evaluation parameters for all three evaluations required rep eating for each test corp ora

most of the corpus analysis steps p erformed on the acquisition corpus This task was partial ly automated

by approximating the source and target realization patterns of each revision rule by a regular expression

of words and partsofsp eech tags All the test corpus sentences matching a given expression were then

automatically retrieved a software to ol called crep cf App endix D and Duford sp ecically designed

for this purp ose Filtering out the incorrect matches resulting from imp erfect approximations was then done

by manual p ostedition

While I do not quantitatively evaluate streaks implementation directly several factors in its design

nonetheless contribute to the high accuracy of the text it generates

4

The result of this evaluation thus do not directly apply to the onepass yet micro co ded generators based on the MeaningText

Theory

The fact that streak uses surge as a syntactic front end guarantees that it is syntactical ly accurate

surge comes with a large set of test inputs to b e run after each change to the grammar This input test

set systematically prob es each branch of the grammar and many combinations of features from dierent

branches The initial input set was incrementally created over seven years during the development of

the generation systems comet McKeown et al McKeown et al cook Smadja and

McKeown and advisorI I Elhadad b The extension of the input set which tests the

extensions from from surge to surge are given in Section B of App endix B

The fact that streak relies only on corpusobserved phrase planning rules lexicalization rules and

revision rules insures that it is essentially semantical ly and lexical ly accurate Each individual syn

tactic construct and vo cabulary item that streak can use to express each domain concept has b een

5

empirically observed in several corpus sentences for the expression of that very concept

The fact that streak monitors corpusobserved limits on the total numb er of facts words

and syntactic emb eddings in a single sentence insures that it is fairly stylistical ly accurate

This enforcement of stylistic accuracy could b e improved by observing in the corpus and monitoring in

the system ner grained complexity limits within sentence sub constituents In addition more testing work

would b e required in order to ful ly guarantee streaks semantic and lexical accuracy This future work is

discussed in Section

Quantitative evaluation of robustness

Review of acquisition corpus results

In Chapter I presented the analysis of one year of basketball summaries from the UPI newswires The

negrained part of the analysis was fo cused on the lead sentences conveying information from only the four

most common semantic classes

Final result and score of the game

Streak extending or interrupting nature of the result

End of the game statistics for players or teams

Record breaking or equalling nature of these statistics

These sentences were made of two toplevel syntactic constituents one grouping facts of classes and

called the gameresult cluster and another grouping facts of classes and called the statistics

cluster

The negrained analysis consisted of identifying the knowledge structures needed for the development

of a system generating such sentences Among these structures

three typ es would b e needed by any generator regardless of its architecture

Semantic classes of content units eg winning streak extension

Combinations of such classes with tokens coo ccurring in a corpus cluster eg game result winning

streak extension

Realization patterns for these combinations cf Fig D p in App endix D for an example pattern

for the game result winning streak extension combination

The fourth typ e revision tools to pro duce complex realization patterns from basic ones eg Adjunc

tization of Range into Instrument to add a winning streak to a game result is sp ecic to the draft and

revision generation architecture prop osed in this thesis

5

As explained in Section the p ossibility however remains that in some untested cases streak may pro duce unfelicitous

compositions of those individually accurate constructs

Evaluation goals and test corp ora

The evaluation had four main goals

6

Estimate how much of the whole sublanguage in a given domain is captured by a oneyear analysis

coverage

7

Estimate how many more knowledge structures need to b e abstracted to keep up with the new textual

data available each year extensibility

Estimate how much is gained in b oth coverage and extensibility by using revision to ols instead of

realization patterns as the knowledge structures on which to base the implementation

Estimate how fast iteratively larger newswire samples converge towards the whole sublanguage

The rst test corpus used for these estimates consisted of lead sentences from the season game

summaries These sentences satised the same semantic restrictions as for the season acquisition

corpus no historical facts other than records and streaks no nonhistorical facts other than end of the game

statistics plus a new one that they contain at least one historical fact This further restriction was added

to reduce the test corp ora to manageable size while preserving the most original topic of the thesis the

expression of historical information in the scop e of the evaluation The semantic ltering of these sentences

was done manually while compiling the rep orts from the news reader Phrasal patterns and revision rules

which were observed in the acquisition corpus ie the season only in sentences with no historical

facts were excluded from the scop e of this evaluation This insured that the semantic additional restriction

ab ove did not bias the results

In order to test whether the sublanguage captured by analyzing successive rep orting seasons quickly

converges toward the whole domain sublanguage I rep eated the evaluation using similarly semantically

restricted lead sentences from a third season For this second round of evaluation the acquired

corpus consisted of sentences from the rst two seasons and the test corpus of sentences from the third

season

Evaluation parameters

In this section I dene the evaluation parameters estimating the coverage and extensibility of the knowledge

structures abstracted from the acquisition corpus For most parameters a dierent denition is needed

for on the one hand the traditional oneshot generation mo del where sentences are pro duced all at once

from the whole conceptual representation of their content and on the other hand the new revisionbased

generation mo del prop osed in this thesis where optional content is added incrementally to an initial simple

draft conveying only obligatory content

Coverage parameters

I rst distinguish b etween four typ es of coverage

Conceptual coverage with resp ect to individual concepts only It measures the prop ortion of test corpus

sentences not generable with the knowledge structures from the acquisition corpus due to the presence

8

of new concepts

Clustering coverage with resp ect to concept combinations only It measures the prop ortion of test

corpus sentences not generable with the knowledge structures from the acquisition corpus due to the

presence of a new combination of known concepts

6

In the example at hand basketball

7

By keep up here I mean providing full coverage of the observed data

8

In the context of this evaluation new is opp osed to known and means not already observed in the acquisition corpus

Paraphrasing coverage with resp ect to realization patterns only It measures the prop ortion of test

corpus sentences not generable with the knowledge structures from the acquisition corpus due to the

presence of a new linguistic forms for expressing known concept combinations

Realization coverage with resp ect to all knowledge structures It measures the total prop ortion of test

corpus sentences not generable with the knowledge structures from the acquisition corpus whatever

the cause

Recall from Section that an imp ortant nding of the acquisition corpus analysis was that the statistic

and result clusters were indep endent ie neither the content nor the form of one inuences those of the

other This observation allows us to compute each coverage parameter separately for the statistic cluster

and the result cluster and then derive the coverage for whole sentences as the pro duct of the two

Conceptual coverage With resp ect to individual concepts there are only two p ossible situa

tions for a test corpus sentence it conveys an instance of a new concept or it contains only instances

of known concepts As shown in g the test corpus T can thus b e partitioned in two C comprising

u

the sentences in situation and C comprising sentences in situation Conceptual coverage V is then

k c

simply dened as the size of C divided by the size of the test corpus

k

Denition Conceptual coverage

car dC

k

V

c

car dT

Clustering coverage With resp ect to concept clustering the situation is a little more com

plex This is due to the fact that a concept cluster can b e new for two dierent reasons b ecause it

includes a new concept or b ecause it groups concepts which though individually known had never b een

seen clustered together b efore Conceptual coverage already accounts for the rst case To keep clustering

coverage indep endent from conceptual coverage sentences conveying instances of new concepts must b e ex

cluded from the computation of clustering coverage As shown in g the remaining part of the test

corpus C can then b e partitioned in two G comprising the sentences clustering concepts into known

k k

concept clusters and G comprising the sentences clustering known concepts in a novel way For oneshot

u

1

generation clustering coverage V can then b e dened as the size of G divided by the size of C

k k

g

Denition Clustering coverage for oneshot generation

car dG

k

1

V

g

car dC

k

This simple denition is valid for oneshot generation b ecause in that framework the choice of what con

cepts to combine in a given sentence is separate from the choice of linguistic form for a given combination

All sentences whether simple or complex are built using concept combination rules which are indep en

dent of linguistic form considerations Computing clustering coverage therefore do es not involve examining

realization patterns capturing the alternative linguistic forms

This is not the case for revisionbased generation where complex sentences are built in part using revision

rules which incorp orate b oth concept combination and linguistic realization knowledge During revision new

concept combinations and new realization patterns are simultaneously created What b ecomes relevant with

resp ect to a concept combination is thus no longer whether it is new or known but rather whether is it

derivable from a known basic combination via known revision rules

If a concept combination was seen in the acquisition corpus then revision rules to derive it from a basic

combination have b een abstracted Therefore all known concept combinations are derivable using known

basic combinations and known revision rules However some unknown concept combination may happ en

to b e also derivable from known basic combinations via known revision rules Consider for example the

following situation T

With New Concept(s)

Cu Ck

With New Combo(s) With New Combo(s) (of known concepts) (of known concepts) but all Derivable from Non−Derivable from

Known Sub−Combo Known Sub−Combo

via Known Revision(s) via Known Revision(s)

G Gk u Only Known u u Concept(s)

Ck

Only Known Combo(s) Gk

Test Corpus Test Sub−Corpus

Figure Test corpus partitions for conceptual and clustering coverage

The test corpus contains a sentence S where the concept combination C gameresult winning

D D

streak losingstreak is realized by a pattern P

D

C was not seen in the acquisition corpus

D

But two subcombinations of C C gameresult winningstreak and C gameresult

D B C

losingstreak were seen in the acquisition corpus resp ectively realized by patterns P and P

B C

In such situations with oneshot generation no concept combination rule for C would have b een

D

abstracted from the acquisition corpus and S would thus not b e covered With revisionbased generation

D

the following knowledge structures would have b een abstracted from the acquisition corpus

A basic realization pattern P for the singleton gameresult

A

A revision rule R to derive P from P

B B A

A revision rule R to derive P from P

C C A

If applying either R on P or R on P yields P then S can b e generated and C is thus derivable

C B B C D D D

Otherwise S cannot b e generated and C is thus not derivable

D D

As shown in g G therefore needs to b e subpartitioned into

u

k

the set G of test corpus sentences clustering concepts in a novel way but using a realization pattern

u

derivable from a known basic pattern via known revision rules

u

the set G of test corpus sentences clustering concepts in a novel way and furthermore using a realization

u

pattern not derivable from a known basic pattern via known revision rules

Because oneshot generation and revisionbased generation use distinct knowledge structures to cluster

concepts a distinct denition of clustering coverage is needed for each While clustering coverage for oneshot

generation was dened cf denition ab ove as the prop ortion of test corpus sentences conveying known

combinations of known concepts for revisionbased it needs to b e dened as the prop ortion of test corpus

sentences conveying concept combinations derivable from a known basic combination via known revision

rules among those not conveying any new concepts

Denition Clustering coverage for revisionbased generation

k

car dG car dG

k

u

r

V

g

car dC

k

Paraphrasing coverage Conceptual coverage measures the ability to cop e with the app ear

ance of new concepts while clustering coverage measures the ability to cop e with the app earance of new

ways to combine known concepts Paraphrasing coverage has yet another task measuring the ability to cop e

with the app earance of novel ways to linguistically express known combinations of known concepts This is

the case for example if the only expression observed in acquisition corpus for the concept combination C

B

scoringplayerNpoint reserveplayer was of the realization pattern PLAYER came o

the b ench to score N p oints and then the alternative realization pattern PLAYER scored N p oints o the

b ench is seen in the test corpus Test corpus sentences conveying new concept combinations must therefore

b e left out of the computation of paraphrasing coverage

As shown in g the remaining part of the test corpus G can then b e partitioned in two P com

k k

prising the sentences realizing known concept combinations using known linguistic forms and P comprising

u

the sentences realizing known concept combinations in a novel way Paraphrasing coverage for oneshot

1

generation V can then b e dened as the size of P divided by the size of G

k k

p

Denition Paraphrasing coverage for oneshot generation

car dP

k

1

V

p

car dG

k Ck

With

New Combo(s) of

Known Concepts

G G k u

New Realization Pattern New Realization Pattern Derivable from Non−Derivable from Known Basic Pattern Known Basic Pattern via Known Revision(s) via Known Revision(s)

Only k u

Known Combo(s) Pu Pu of Known Concepts G

k Known Realization Pattern

Pk

Test Sub−Corpus Test Sub−Sub−Corpus

Figure Test corpus partitions for paraphrasing coverage

If a realization pattern is known then a revisionrule to derive this pattern from a known basic one

is also necessarily known However an unknown realization patterns may or may not b e derivable from a

known basic pattern via known revision rules As shown in g P therefore needs to b e subpartitioned

u

k u

into P comprising sentences with derivable patterns and P those with underivable patterns Paraphrasing

u u

S

k r

divided by the size of P can then b e dened as the size of P coverage for revisionbased generation V

k

u p

G

k

Denition Paraphrasing coverage for revisionbased generation

k

car dP car dP

k

u

r

V

p

car dG

k

Realization coverage Conceptual clustering and paraphrasing coverages were delib erately

dened indep endently of each other to evaluate dierent tasks in the generation pro cess and identify which

task is the b ottleneck of the whole pro cess It is also necessary however to evaluate the generation pro cess

as whole For this purp ose I dene realization coverage simply as the prop ortion of test corpus sentences

generable using the knowledge structures abstracted in the acquisition corpus

In oneshot generation a sentence is generable i it conveys a known combination of known concepts

1

using a known realization pattern Realization coverage for oneshot generation V can therefore b e simply

r

dened as the size of P divided by the size of the test corpus

k

Denition Realization coverage for oneshot generation

car dP

k

1

V

r

car dT

In revisionbased generation a sentence is generable i it falls in either one of the following three cate

gories

It conveys a known combination of known concepts using a known realization pattern ie it b elongs

to P

k

It conveys a known combination of known concepts using a realization pattern that is new yet derivable

k

from a known basic realization pattern using known revision rules ie it b elongs to P

u

It conveys a combination of known concepts that is new yet derivable from a known basic combination

k

via known revision rules ie it b elongs to G

u

S S

k k r

G P Realization coverage for revisionbased generation V can therefore b e dened as the size of P

k

u u r

divided by the size of the test corpus

Denition Realization coverage for revisionbased generation

k k

car dP car dP car dG

k

u u

r

V

r

car dT

Extensibility coverage

Coverage parameters measure the prop ortion of the test corpus sentences accounted for by the knowledge

structures abstracted from the acquisition corpus In contrast extensibility parameters measure the prop or

tion of new know ledge structures needed to cover the whole test corpus In other words coverage estimates

how go o d a job one could do without additional work whereas extensibility estimates how much more work

is needed to do a p erfect job

Having distinguished four typ es of coverage I likewise distinguish four typ e of extensibility

conceptual extensibility with resp ect to concepts only

clustering extensibility with resp ect to concept combinations only

paraphrasing extensibility with resp ect to realization patterns only

realization extensibility with resp ect to all knowledge structures

Collectively these extensibility parameters dier from the coverage parameters in that they measure

prop ortions of types instead of prop ortions of tokens An uneven distribution of tokens of dierent typ es

can lead to very dierent values for the corresp onding coverage and extensibility parameters Supp ose for

example that concepts were observed in the acquisition corpus and concepts were observed in the test

corpus with elements common to b oth sets Then conceptual extensibility is only However if the

common elements are also the most common it is p ossible that for example they are the only concepts in

out of the sentences making up the test corpus Then conceptual coverage is

Individually each extensibility parameter is dened to mirror the corresp onding coverage parameter at

the typ e level The relation b etween coverage parameters pictorially represented by g and therefore

hold as well for extensibility parameters In particular

Clustering extensibility is dened indep endently of conceptual extensibility

Paraphrasing extensibility is dened indep endently of b oth clustering and conceptual extensibility

Clustering paraphrasing and realization extensibility are dened dierently for oneshot generation

and revisionbased generation

Also likewise coverage parameters extensibility parameters are dened as the ratio b etween the cardi

nal of two sets Each extensibility parameter diers from its coverage parameter counterpart is that the

sets concerned are sets of linguistic structures instead of sets of corpus sentences The seven extensibility

parameters are therefore dened as follows

Denition Conceptual extensibility

N

c

X

c

T

c

where

N Number of new concepts observed in the test corpus

c

T Total number of concepts observed in the test corpus

c

Denition Clustering extensibility for oneshot generation

N

g

1

X

g

T

g

where

N Number of new combinations of known concepts observed in the test corpus

g

T Total number of combinations of known concepts observed in the test corpus

g

Denition Paraphrasing extensibility for oneshot generation

N

p

1

X

p

T

p

where

N Number of new realization patterns for known concept combinations observed in the test corpus

p

T Total number of realization patterns for known concept combinations observed in the test corpus

p

Denition Realization extensibility for oneshot generation

N N N

c g p

1

X

r

T T T

c g p

Denition Clustering extensibility for revisionbased generation

g

N

r r

X

p

g

B T

r

where

g

Number of new revision rules needed to derive from the basic concept combinations the new combina N

r

tions of known concepts observed in the test corpus

and

B Number of basic concept combinations

g

T Total number of revision rules needed to derive from the basic concept combinations al l the combina

r

tions of known concepts observed in the test corpus

Denition Paraphrasing extensibility for revisionbased generation

p

p

N N

r

r b

X

p p

p

T T

r

b

where

p

N Number of new basic realization patterns

b

p

N Number of new revision rules

r

together needed to derive the new realization patterns for known concept combinations observed in the test

corpus

and

p

Total number of basic realization patterns T

b

p

T Total number of revision rules

r

together needed to derive al l the realization patterns for known concept combinations observed in the test

corpus

Denition Realization extensibility for revisionbased generation

N N N

c b r

r

X

r

T T T

c b r

where

N Number of new basic realization patterns

b

N Number of new revision rules

r

together needed to derive the new realization patterns observed in the test corpus

and

T Total number of basic realization patterns

b

T Total number of revision rules

r

together needed to derive al l the realization patterns observed in the test corpus

Partially automating the evaluation

Obtaining the evaluation parameters dened in the previous section required rep eating for b oth test corp ora

most of the corpus analysis steps p erformed on the acquisition corpus namely

List the domain concept combinations observed in the corpus clusters

List the observed realization patterns of each such combination

Identify the revision to ol to pro duce each such pattern from another simpler pattern

For the initial acquisition corpus analysis each step was p erformed entirely by hand To avoid rep eating

this long and tedious pro cess twice over in its entirety I lo oked for substeps with p otential for automation

During each evaluation round each step was decomp osed into two parts

a Recognize the test corpus sentences corresp onding to usage of a structure ie resp ectively a

content combination a realization pattern or a revision to ol already abstracted from the acquisition

corpus

b Dene and classify new structures for the remaining test corpus sentences

The rst part is a verication task while the second part is a discovery task The only go o d candidates

for automation are verication tasks for verication purp oses semantic information can b e approximated

by known lexicosyntactic patterns In contrast discovery tasks that involve semantic analysis can hardly

b e automated b efore observing the rst lexicosyntactic realization of a semantic message there is no way

to enco de the various linguistic expressions of that message as a mark of its usage In our present context

this means that only substep a could b e automated

To address this need I initiated and sup ervised the development of crep Duford a system that

retrieves in a corpus all the sentences containing a lexicosyntactic pattern sp ecied as a regular expression

of words andor partofsp eech tags crep was implemented by Duford A brief selfcontained presentation

of this software to ol is given in App endix D This app endix also contains detailed implementationlevel

examples of its usage to partially automate and sp eedup the corp ora analyses that underlied b oth the

development of streak and the evaluation of the draft and revision generation mo del on which it is based

In what immediately follows I explain at a more intuitive level the role of crep for partially automating

substep a ab ove for b oth evaluation rounds

crep was rst used to compute the prop ortion of clusters in the rst test corpus corresp onding to usage

of realization patterns abstracted from the acquisition corpus Each realization pattern was enco ded as a

crep expression Recall from Section that realization patterns abstract away from domain references

sp ecic lexical items and lowlevel syntactic variations to capture the mapping from a concept combination

onto a particular syntactic structure They sp ecify the syntactic category used to express each concept and

the structural dep endencies b etween these categories The pro cess of enco ding a realization pattern as a

crep expression is presented in Section D of App endix D

The resulting crep expressions were incrementally rened and tested rst on the acquisition corpus

It is only after these expressions yielded exactly the same results as those of the manual analysis on the

9

acquisition corpus that they were run on the rst test corpus For such a systematic run the crep package

includes a sp ecial shell taking as input a le where each expression E corresp onding to a given realization

pattern is paired with a le name F For each E F pair this shell redirects the test corpus sentences

matching E into F It also redirects the sentences matching none of the expressions into a nomatch le

Manual analysis of the nomatch le is then required to get the values of the evaluation parameters

Two systematic crep searches were indep endently p erformed on the rst test corpus One with expres

sions for statistic cluster realization patterns and the other with expressions for result cluster realization

patterns After these searches the presence of a cluster in one of the two resulting nomatch les has three

p ossible causes

9

Some result discrepancies b etween the manual analysis and the crep expression runs on the acquisition corpus uncovered

errors in the former

The cluster follows a new realization pattern for a concept combination already seen in the acquisition

corpus ie it is a memb er of P

u

The cluster combines concepts individually seen but never clustered together in the acquisition corpus

ie it is a memb er of G

u

The cluster contains a new concept not seen in the acquisition corpus ie it is a memb er of C

u

The rst step in analyzing each nomatch le thus consisted in partitioning it into these three subles

P G and C From the numb er of elements in each of these les and in the union P of all nomatch les

u u u k

the rst round values of the coverage parameters for oneshot generation were then derived using denitions

10

and

During the pro cess of partitioning each nomatch le into the corresp onding P G and C les the

u u u

classications of concepts concept combinations and realization patterns were extended to the new items

encountered in each of these three les From the numb er of new elements in each of these three classica

tions the rst round values of the extendibility parameters for oneshot generation were then derived using

denitions to

The extension of these three classications was also a prerequisite to the second round of evaluation

where the acquisition corpus b ecame the union of original acquisition corpus and of the rst round test

corpus For this second round a new crep expression with accompanying extensions in the denition le

was written for each new realization pattern identied during the rst round These expressions were rst

tested on the new acquisition corpus consisting of the original acquisition corpus plus the rst test corpus

Batch crep runs using b oth the rst year expressions and the new expressions were then p erformed on the

second year corpus Finally the nomatch les pro duced by these second round runs were in turn analyzed

This analysis yielded the second round values of b oth coverage and extensibility parameters for oneshot

generation

Computing the evaluation parameters for revisionbased generation required further analysis of the no

u k

match les in particular to subpartition the nomatch suble G into G and G and the nomatch suble

u

u u

u k

P into P and P This partitioning which needed to b e rep eated for each evaluation round and each

u

u u

cluster typ e involved identifying the surface decrement of each new realization pattern abstracted from the

two test corp ora Recall from Section that the surface decrement of a realization pattern P is another

t

realization pattern P that is syntactically closest to P among all the patterns conveying exactly one less

t

d

content unit than P The surface decrement of a new pattern identied during either evaluation round was

t

searched for among the nal set of realization patterns obtained at the end of the second round Each new

surface decrement pair was then checked against the revision to ol abstracted from the acquisition corpus

For this verication there were four p ossible outcomes

The new realization pattern resulted from the new application of a revision to ol abstracted from the

acquisition corpus on a base pattern also abstracted from the acquisition corpus

The new realization pattern resulted from the application on new base pattern of a revision to ol

abstracted from the acquisition corpus

The new realization pattern resulted from the application of a new revision to ol on a base pattern

abstracted from the acquisition corpus

The new realization pattern resulted from the application of a new revision to ol on a new base pattern

k

The rst outcome means that the sentences using this pattern are in G if they convey a new combination

u

k

of concepts and in P otherwise The last three outcomes mean that the sentence using this pattern are in

u

u u

G if they convey a new combination of concepts and in P otherwise From the numb er of elements in each

u u

of these four nomatch subles b oth rounds of coverage parameters for revisionbased generation were then

derived using denitions and

During this further analysis of the crep run nomatch les the classication of b oth the base patterns

and the revision rules were extended to the new items encountered From the numb er of new elements in

10

See Section for the denition of each evaluation parameter

Conceptual Clustering

oneshot w revision gain

statistic clusters round

result clusters round

whole sentences round

statistic clusters round

result clusters round

whole sentences round

statistic clusters average

result clusters average

whole sentences average

statistic clusters 

result clusters 

whole sentences 

Table Conceptual and clustering coverage

Paraphrasing Realization

oneshot w revision gain oneshot w revision gain

statistic clusters round

result clusters round

whole sentences round

statistic clusters round

result clusters round

whole sentences round

statistic clusters average

result clusters average

whole sentences average

statistic clusters 

result clusters 

whole sentences 

Table Paraphrasing and realization coverage

b oth these classications b oth rounds of extensibility parameters for revisionbased generation were then

derived using denitions to

Results

The evaluation results for conceptual and clustering coverage are given in table and those for paraphrasing

and realization coverage are given in table The values for statistic and result clusters for each evaluation

k u k u

round where computed directly from the cardinals of the sets C G G P P and P that partition the

u k

u u u u

test corpus as explained in Section Next to the value in p ercent the absolute numb er corresp onding

to is also given Following the empirically veried indep endence b etween clusters cf Section the

value for whole sentences for a given round is dened as the pro duct of the values for each cluster for that

round For each parameter the average over b oth rounds and the dierence called b etween them are

also given For clustering paraphrasing and realization coverage the column gain indicates the dierence

b etween the parameter values for oneshot generation and those for revisionbased generation

The evaluation results for conceptual and clustering extensibility are given in table and those for

paraphrasing and realization extensibility are given in table The values for b oth statistic and result

g

clusters and each evaluation round were computed directly from the numb ers N T N T N T B T

c c g g p p

b

p p

g g p p

N T N T N T N T N and T dened in Section

b b r r

r r r r

b b

The values for whole sentences for a given round is dened as the average of the resp ective p ercentages

for each cluster Because those p ercentages were obtained from samples of dierence sizes this average is

weighted by the those sample sizes In other words

S S R R

p a p a

W

S R

a a

with

W whole sentence p ercentage

S statistic cluster p ercentage

p

R result cluster p ercentage

p

S numb er of statistic clusters

a

R numb er of result clusters

a

Unlike for the coverage tables the gain column values of the clustering paraphrasing and realization

parameters in the extensibility tables were not computed straightforwardly as the dierence b etween the

corresp onding p ercentages in the oneshot and withrevision columns These p ercentages indicate the pro

p ortion of additional knowledge structures needed to fully cover the test corpus with each approach Since

each approach uses dierent knowledge structures these prop ortions have unrelated denominators and sub

tracting one from the other would not yield a very informative quantity Instead the extensibility gain G of

revisionbased generation over oneshot generation is computed using the following formula

K K

1 r

G

K

1

with

K numb er of additional knowledge structures needed to fully cover the test corpus with a oneshot

1

approach

K numb er of additional knowledge structures needed to fully cover the test corpus with a revision

r

based approach

In terms of the numb ers used in the other extensibility denitions cf Section K instantiates as

1

N for clustering extensibility

g

N for paraphrasing extensibility

p

N N N p for realization extensibility

c g

Similarly K instantiates as

r

g

B N for clustering extensibility

r

p

p

N for paraphrasing extensibility N

r

b

N N N for realization extensibility

c b r

In all four result tables the most interesting values are b oldfaced The rst imp ortant result is the

value for the tworound average of the realization coverage parameter for whole sentences and oneshot

generation This result means that the concepts the halfsentence concept combinations and the half

sentence realization patterns abstracted from the lead sentences over a given year only account for a little

Conceptual Clustering

oneshot w revision gain

statistic clusters round

result clusters round

whole sentences round

statistic clusters round

result clusters round

whole sentences round

statistic clusters average

result clusters average

whole sentences average

statistic clusters 

result clusters 

whole sentences 

Table Conceptual and clustering extensibility

Paraphrasing Realization

oneshot w revision gain oneshot w revision gain

statistic clusters round

result clusters round

whole sentences round

statistic clusters round

result clusters round

whole sentences round

statistic clusters average

result clusters average

whole sentences average

statistic clusters 

result clusters 

whole sentences 

Table Paraphrasing and realization extensibility

more than a third of the content and linguistic forms of the same sentences the following year The second

imp ortant results is the delta value for this same parameter from for the rst round down to

for the second It means that the sublanguage sample captured by these three knowledge structures

over is year or two is nowhere near converging towards the whole domain sublanguage

What causes this rather low overall coverage The average for conceptual coverage with a p ositive

delta of from after a year up to after two shows that it is not the app earance of new

concepts The domain ontology seems to have b een pretty much captured by a single year of analysis The

and averages for clustering and paraphrasing coverage b oth with negative deltas indicate

that one b ottleneck is the app earance of new combinations of known concepts and another is the app earance

of new linguistic forms to convey known combinations of known concepts They are b oth ab out as signicant

since while paraphrasing coverage is lower than clustering coverage the ability of a generator to express a

group of concept by various linguistic forms is less crucial than its ability to group concepts in a variety of

ways with at least one linguistic form available for each group

The revisionbased approach where new combinations and new realization patterns can b e derived from

simpler ones by applying known revision rules suppresses b oth b ottlenecks

Clustering coverage jumps to a improvement from the oneshot approach and thus

b ecomes even higher than the conceptual coverage

Paraphrasing coverage jumps to a improvement from the oneshot approach

With a realization coverage at a sp ectacular improvement from the oneshot approach

its is almost of the test corpus sentences whose content and linguistic form are captured by the basic

concept combinations basic realization patterns and revision rules abstracted on the acquisition corpus

The initial intuition than the more comp ositional revisionbased approach would improve robustness is thus

impressively conrmed by the results of the quantitative evaluation of coverage

The results of the quantitative evaluation of extensibility further conrms the sup erior robustness of the

more comp ositional approach With oneshot generation the average realization extensibility for whole sen

tences is This parameters measures the prop ortion of new knowledge structures needed to maintain

fullcoverage With revisionbased generation it come down to only However the improvements of

the revisionbased over the oneshot approach for extensibility though signicant from a little less than a

half to a little less than a third it is less impressive than for coverage Even with the revisionbased ap

proach almost a third of the knowledge acquisition task must b e redone every new year in order to maintain

fullcoverage of all concept combinations and linguistic forms This may seem at rst a quite a forbidding

prosp ect Fortunately such coverage is not needed in most practical applications Instead coverage

is tradedo against the overhead of further knowledge acquisition For example thanks to discrepancies

among the o ccurrence frequencies of the dierent domain concepts and sublanguage linguistic forms almost

coverage can b e attained by implementing only a hardcore of knowledge structures acquired over a

single year with a revisionbased approach

Quantitative evaluation of p ortability

Having evaluated the robustness of the corpus analysis results for the domain in which they were acquired

in the previous section I now turn to the evaluation of their p ortability to a new domain I rst describ e

the test corpus used for this p ortability test and the sp ecic linguistic structures that were tested I then

describ e the various steps involved in this evaluation and nally discuss the results

Starting p oint

The corpus analysis presented in Section resulted in a set of revision to ols used in basketball summaries

The goal of this second evaluation eort is to estimate to what degree these revision to ols are also used in

another domain Since these to ols were then implemented in the reviser of the generator system streak

the results of this evaluation will also indirectly give an idea of streaks p ortability

As explained in Section the basketball domain corpus from which the linguistic structures to evaluate

were abstracted was restricted to rep ort sentences that conveyed no other typ e of information than

Final result and score of the game

Streak extending or interrupting nature of the result

End of the game statistics for players or teams

Record breaking or equalling nature of these statistics

These typ es were chosen b ecause they were the four most commonly conveyed in the basketball summaries

The p ortability prosp ect of linguistic structures like revision to ols that have a semantic asp ect could b e

estimated only in quantitative domains where similar typ es of information are also common This is the case

of the sto ck market meteorology accounting lab or statistics etc As the test domain for this evaluation

I chose the sto ck market for two reasons large textual corp ora of rep orts in this domain are available

from newswires and it has b een the ob ject of several language generation pro jects Kukich Contant

Smadja

The test corpus consisted of rep orts on the American Asian and Europ ean sto ck markets by UPI AP and

Reuter compiled from the newsreader Its size was ab out words The evaluation involved considering

each revision to ol abstracted from the basketball corpus and lo oking for evidence of its usage in the sto ck

market corpus in order to compute the prop ortion of revision to ols common to b oth domains

As explained in Section some revision to ols were used for adding b oth historical and nonhistorical

typ es of information while others were used for only one of these two typ e and not the other Since

incorp oration of historical background is another innovative asp ect of this thesis I fo cused the p ortability

evaluation like the robustness evaluation on those to ols that are used to add historical information either

sp ecialized or versatile ie I excluded the to ols sp ecialized for adding of nonhistorical facts However in

order to keep the set of revision to ols evaluated large enough I considered as the acquisition corpus all three

years of analyzed basketball summaries the rst year from which the initial set of revisions were identied

and the two following years that were used as test corp ora for the robustness evaluation and during which

the classication of revision to ols was extended to account for the new cases encountered

The nal total classication resulted in a hierarchy of revisions that is given in app endix A A revision

operation is the application of a revision tool p ossibly accompanied by a side transformation The upp er

levels of the hierarchy classify revision to ol applications The b ottom level sub divides applications of the

same to ol with dierent side transformations Dep ending on whether this b ottom distinction is considered

or not the p opulation evaluated for p ortability consisted of either classes of revision to ol applications or

classes of revision op erations I estimated the p ortability at every level in this hierarchy

Metho dology

Identifying the source and target realization patterns of each revision to ol

Revision to ols are abstract transformations op erating on linguistic structures Their usage therefore cannot

b e directly observed in a corpus Each to ol emb o dies the structural changes necessary to transform a source

realization pattern for a given concept combination into a target realization pattern for the same concept

combination enriched with one additional concept Usage of a revision to ol in a corpus must b e indirectly

detected by lo oking for usage of these source and target realization patterns

The realization patterns whose robustness was evaluated in Section were capturing the semantic and

syntactic structures of phrase clusters ie roughly of halfsentences cf Section for the exact denition

of phrase clusters In the general case a revision to ol do es not apply to the cluster as a whole Rather

it applies lo cally to one of the clusters sub constituents As a result the same revision to ol may apply to

many dierent source patterns and resulting in many dierent target patterns Consider for example the

four corpus clusters b elow

Patrick Ewing scored p oints C

S

1

Armon Gilliam scored a franchise record p oints C

T

1

Ricky Pierce scored p oints and Eddie Johnson came o the b ench to add C

S

2

scored p oints and came o the b ench to add a seasonhigh C

T

2

These four phrases are two examples of what I called surface decrement pairs in Section More

Even though the is a surface decrement of C and C is a surface decrement of C precisely C

T S T S

2 2 1 1

dier b oth semantically and syntactically the and C realization pattern of the two source clusters C

S S

2 1

same revision to ol Adjoin of Classifier is nonetheless applied to b oth to obtain the realization patterns

This is p ossible b ecause the revision to ol is applied and C of the two corresp onding target clusters C

T T

2 1

and C lo cally to an NP sub constituent whose subrealization patterns is shared by b oth C

S S

2 1

Because each revision to ol has p otentially several source patterns and several target patterns asso ciated

with it the rst task of this revision to ol p ortability evaluation consisted in compiling for each to ol the

complete list of its source and target patterns

Abstracting common denominators of source and target patterns

Once realization patterns were prop erly indexed by revision to ols the next step consisted of comparing

for each to ol all its whole cluster source patterns to identify a common denominator Such common de

nominators were also identied for its whole cluster target patterns with the additional requirement that it

11 12

contains the phrase added by the revision The signature of a revision to ol can then b e dened as the

pair common source subpattern common target subpattern The signature of nominalrank adjoin

classifier with example phrases is shown in Fig

11

And displaced in the case of complex revisions

12

To make sure that the common target pattern do es not reduce to the common source pattern

Common Source Subpattern Common Target Subpattern

NP NP

det head det classifier head

41 points a [ franchise record ] [ 41 points ]

Figure Signature of Adjoin of Classifier w example phrases

to lead Chicago to a triumph over the that extended the Bulls win streak

to four games

lifting the Indiana Pacers to a victory over Miami that extended the Heats losing streak

to six games

p owering the Boston Celtics to a victory over Cleveland that snapp ed the Cavaliers win

ning streak at games

to pace Atlanta to a triumph over the Boston Celtics that snapp ed the Hawks slide

at three games

leading the Utah Jazz to a victory over Los Angeles that snapp ed the Lakers eight game

winning streak

leading New York to a victory over the Orlando Magic that snapp ed the Knicks three game

losing streak

to lead the Cleveland Cavaliers to a win over Detroit that gave the Pistons their third

straight defeat

Figure Example of nonoverlapping set of target clusters for a given to ol

For some to ols there is no unique subpattern common to al l its source or target whole cluster patterns

In such cases the signature of the to ol has to b e dened as a pair of subpattern lists with each element in

a list covering one of the maximal disjoint classes of whole cluster realization patterns

This is the case for example of Adjoin of Relative Clause to Top NP One example cluster from

the acquisition corpus for each of its target realization pattern is given in Fig No subset of the semantic

13

elements added by the revision and highlighted in b old is mapp ed exactly onto the same syntactic category

in al l seven clusters shown in that gure For example while in the streak length underlined is conveyed

by a lo cative PP in it is conveyed by a classier and in by an ordinal determiner Similarly the very

fact that the added phrase rep orts a streak while conveyed by a noun streak or slide in is conveyed

by the adjective straight in The common denominator to al l the target patterns thus reduces to its

source pattern which cannot constitute alone the signature of the to ol Therefore the target side in this

to ol signature need to b e dened as a list of three subpatterns the rst covering sentences the second

sentences and the third sentence

In the particular case ab ove the source side in the signature is a unique pattern In general however b oth

the source and target sides may b e lists Not all pairs that can b e formed from these two lists corresp ond to

the application of a revision to ol Therefore instead of simply two lists of subpatterns the general denition

of a revision to ol signature is a list of source subpattern target subpattern pairs where the source

14

subpattern is a surface decrement of the target subpattern Usage of the revision to ol in the test corpus

can then b e detected by lo oking for usage of both the source and the target subpattern from any pair in this

list

Approximating common subpatterns as crep expressions

As semanticosyntactic structures the source and target realization patterns of a revision to ol are in fact

themselves not directly detectable in an automated way Following a similar approach as for the robustness

evaluation I approximated realization patterns by lexicosyntactic patterns enco ded as crep expressions

This approximation allows partial automation of the search for realization patterns usage in the test corpus

The crep expressions for the whole cluster realization patterns observed in the two rst years of basketball

13

Recall that a realization pattern captures the mapping b etween semantic elements and syntactic categories

14

cf Section for the denition of surface decrement

Signature sourceAtargetA sourceAtargetA sourceAtargetA

st source pattern team reference followed at a distance of at least one word

by a determiner a score a synonym of win a synonym of over and another

team reference

SOURCEA TEAM DET SCORE NWIN OVER TEAM

st target pattern matching the st source pattern followed at a distance of at

least one word by that a verb at the past tense expressing either an

extension or interruption a team reference a genitive marker a noun

conveying a streak a preposition and an expression of the streaks length

TARGETA SOURCEA that VPEXTENDVPEND TEAM APOST NSTREAK IN STREAKLENGTH

Variation where the streak length pre instead of post modifies the streak noun

TARGETA SOURCEA that VPEXTENDVPEND TEAM APOST STREAKLENGTH NSTREAK

Alternative form a transfer of possession clause

TARGETA SOURCEA that VGAVE TEAM POSS ORD STRAIGHT NSTREAK

Figure crep expressions approximating the signature of Adjoin of Relative Clause to Top NP

rep orts were available from the robustness evaluation Using these expressions as a starting p oint three

tasks remained in order to enco de the signature of a each revision to ol as a list of crep expression pairs

Write the missing crep expressions ie those for whole clusters realization patterns that were observed

for the rst time in the third year of basketball rep orts

Now that each to ol was asso ciated with a complete list of source crep expressions and target crep

expressions for the basketball domain nd the common denominators among these expression lists

Group the source and target common denominators into surface decrement pairs The list of crep

sub expression pairs obtained constituted the approximate signature of the revision to ol

Step ab ove starting from the original crep expressions enco ding whole cluster patterns and getting

the crep subexpressions approximating a revision to ol signature for the same domain was describ ed in

general terms in the previous section In terms of crep approximation it translates into two main pro cesses

Subexpression suppression eliminating part of the whole cluster pattern not involved in the revision

pro cess

15

Subexpression generalization factoring out semantic and syntactic details distinguishing b etween

realization patterns but irrelevant at the higher abstraction level of revision to ols

The approximate signature of Adjoin of Relative Clause to Top NP is given in Fig The sub

16

expression denitions app earing in the source and target patterns of this signature are given in Fig

These target subpattern match the example sentences of the previous section as follows TARGETA

matches sentences TARGETA matches sentences and TARGETA matches sentence These

expressions factor out lowlevel semantic distinctions such as streak extension sentences vs streak

interruption sentences or winning streak sentences vs losing streak sentences This

factoring was done via two subexpression generalization

Grouping WINSTREAK with LOSESTREAK the subexpressions for noun or noun comp ounds

resp ectively expressing winning and losing streaks shown in lines and at the top of in

Fig

15

By adding disjunctions in the subexpression denition

16

Each of these subexpressions is glossed in comments ab ove its denition to make it understandable without knowing crep

syntax The reader interested in understanding the detailed crep enco ding of each subexpression should refer to section D

where this syntax is explained

Nominals conveying victories

NWIN victorywinblowoutblow outdefeatroutdrubbingtriumph

decisionrompupset

Nominals conveying streaks

NSTREAK WINSTREAKLOSESTREAK

WINSTREAK winning streakspreeflurryseries

LOSESTREAK slidelosing streaklosing skiddroughtslump

STREAKLENGTH CARDPREMOD GAME

Cardinal number in premodifying position

CARDPREMOD twothreefourfivesixseveneightnineteneleven

Synonyms for game

GAME gamedecisionoutingcontestsessionday

Ordinal number

ORD firstsecondthirdfourthfifthsixthseventheighthninthtenthCD

thstndrd

STRAIGHT straightconsecutive

Determiners

DET ARTICLEPOSS

ARTICLE anthe

POSS myyourherhisitsourtheir own

APOST ss

OVER overversusagainstof

Past verbs conveying extensions

VPEXTEND extendedstretchedprolongedrode

VPEND snappedendedbrokestoppedhaltedinterrupted

VGAVE gavebroughthandedsentdealt

Figure Denition le for the subexpressions of Fig

Grouping of VPEND with VPEXTEND the subexpressions for past verbs resp ectively expressing

streak extensions and interruptions shown in lines at the top and the three lines at the b ottom

of Fig

The acquisition domain signatures could not b e used as is on the test domain Further subexpression

suppressions and generalizations were needed b ecause the corresp ondence b etween the realization patterns

in the acquisition and the test domain is an imp erfect one This imp erfection is ro oted in three dierent

typ es of discrepancies b etween the two domains conceptual lexical and rhetorical The rst o ccurs when the

structure of a concept in the test domain do es not exactly match its corresp onding concept in the acquisition

domain The second o ccurs when a concept structure shared by b oth domains is nonetheless lexicalized

dierently in each of them The third o ccurs when some typ e of information though conceptually present in

b oth domains is absent from the rep orts of one domain for no other apparent reasons than some domain

sp ecic rhetorical convention An example of each typ es of discrepancies is given in the next subsections

describing the precise pro cess of p orting a crep expression

Identifying crossdomain corresp ondence of conceptual structures

The rst task for p orting crep expressions to the test domain is to identify for each conceptual structure of

the acquisition domain basketball a counterpart structures in the test domain sto ck market At this p oint

it is necessary to briey review the ontology of the basketball domain presented in Section It contained

ve basic conceptual structures

qualityplayerteamgame eg the hotshooting Boston Celtics

statisticplayerteamvalueunit eg Patrick Ewing scored points

gameresultwinnerloserscore eg the New York Knicks routed the Philadelphia ers

recordstatisticstreakvaluedirectionduration eg Armon Gil liam scored a franchise

record points

streakgameresultrecordteamresulttypeaspectlength eg the Boston Celtics won their

fourth straight game

The sto ck market domain includes a numb er of conceptual structures Only some of them are p ossible

candidates for sto ck market counterparts of the ve basketball conceptual structures They are

qualityindicatorcompany eg the buoyant Hong Kong market

variationindicatordirectionvaluevalueunit eg The Gold Index lost points to

dayresultadvancesdeclinesscore eg advances overpowered declines to

recordvariationvaluedirectionduration eg Volume climbed to a record bil lion

share

streakvariationrecordindicatordirectionaspectlength eg the Dow transportation

average retreated for the second straight session

Let us examine each basketball conceptual structure in turn to see which sto ck market structures may

corresp ond to them quality is an example of p erfect corresp ondence b etween acquisition and test concep

tual structures as illustrated by the pair of examples phrases the hotshooting Boston Celtics and the

buoyant Hong Kong Market It is also the simplest but least common of all structures

For statistic the only go o d counterpart candidate is variation They b oth share an agentive role

playerteam for statistic and indicator for variation as well as a unit role There are two role

17

mismatches the roles direction and value of variation have no counterpart in statistic Although

17

ie whether the index went up or down

there were cases of twovalued statistics in the basketball domain namely sho oting p erformances eg Reg

gie Mil ler made of shots they were very rare compared to the singlevalued ones In contrast there

18

were no o ccurrences of singlevalued variation in the sto ck market corpus though they are conceivable

These two mismatches are examples of conceptual discrepancies that do not require adjusting revision

to ol signatures This is b ecause most revisions applying to a clause realizing a statistic op erate local ly on

the NP realizing the value and unit roles of the statistic For example in Armon Gil liam scored a franchise

record points the revision is lo cal to the NP points Thus when lo oking for similar revisions in

the sto ck market corpus the fact that the source clause structure may dier eg Armon Gil liam scored

points vs The Gold Index lost points to is immaterial

For gameresult the immediate candidate counterpart is dayresult They b oth sumup all the events

that to ok place during the timeslice of the rep ort a game in the basketball domain a day or halfday in

the sto ck market domain and they share an antagonistic role structure resp ectively winner vs loser and

advances vs declines However there were no dayresult streaks in the test corpus ie no sentence like

Advances outnumbered declines for the third consecutive days

19

This absence can b e explained as follows

The conceptual structures variationmarket and dayresult are totally correlated if the mar

ket go es up then there are more advances than declines and viceversa therefore including streak

information for b oth would b e rather redundant

The rhetorical convention in the sto ck market corpus rep orts is to always convey variationmarket

up front and dayresult further down in the rep ort structure as a result if a streak is interesting

enough to b e rep orted it is then attached to the former rather than to the latter

This absence of dayresult streaks is an example of rhetorical discrepancy b etween the two domains

Since the present study fo cuses on gameresult streaks rather than on isolated gameresults the consequence

of this rhetorical discrepancy is that a test domain conceptual structure other than dayresult had to b e

considered as counterpart to gameresult one whose streaks are conveyed in the sto ck market corpus The

only candidate that satises this requirement is variation eg

the Amex Market Value Index inched up to for its sixth straight advance

However there is a role mismatch b etween gameresult and variation variation is missing the

antagonistic winner vs loser role structure of gameresult Instead whether the agent has won or lost

is indicated by a direction role ie while a basketball team wins or loses against a sp ecic adversary

a nancial indicator wins or lose p oints on its own So dep ending on the value of the direction role

the indicator role is the counterpart of either the winner role or the loser role This is an example

of crossdomain discrepancy requiring subexpression suppression while p orting the approximate signatures

of the revision to ols Consider again the revision rule Adjoin of Relative Clause to Top NP and the

SOURCEA subexpression used in its crep approximated signature shown on line from the top of

Fig The absence of a loser role in the nancial domain means that p orting SOURCEA requires

deleting its trailing part OVER TEAM which matches the loser role in the sp orts domain

All the corresp ondences established so far were for rstorder structures The secondorder structures

namely record and streak take as main argument one of these rstorder structures Their other arguments

are the same in b oth domains They thus do not intro duce further conceptual discrepancies

Acquiring the vo cabulary sp ecic to the test domain

Once a corresp ondence has b een established b etween an acquisition domain concept to a test domain concept

the revision to ol signatures involving these concepts need to b e adjusted to account for dierences in wording

18

This absence in the sto ck market corpus of cases were either the day spread or the nal value is rep orted without the other

motivated the view of clauses like The Gold Index lost points to as realizing a twovalued concept as opp osed two

singlevalued concepts sharing the same indicator direction and unit

19

Thanks to Jason Glazier and Karen Kukich for providing insights on sto ck market domain sp ecics

Since in b oth our domains most referring expressions use prop er names they constituted the main stumbling

blo ck to p orting the to ol signatures Names simply do not p ort The set of crep expressions was therefore

split in three a rst subset for the basketball sp ecic vo cabulary a second subset for the sto ck market sp ecic

vo cabulary and a third subset for the shared vo cabulary Denitions for referring expressions were part of the

domain sp ecic subsets While most verbs and common nouns were in the shared subset elements of some

synonymous groups were found in the domain sp ecic subsets For example to express the interruption of a

streak the verb to from is widely used in the sto ck market domain In contrast in the basketball

domain to break to snap to halt to end and to interrupt are preferred The vo cabulary of the

test domain was acquired by running crep with acquisition domain expressions written for the robustness

evaluation where the domain references where replaced by wildcards trapping unknown words b etween

known ones This technique is discussed b oth in Section D of App endix D and in Section of the crep

manual Duford

Porting a crep expression from one domain to another did not just involve accounting for the conceptual

and lexical discrepancies just presented It also required attention to details like minor rhetorical discrep

ancies which can prevent crep expression from matching the desired corpus sentences Examples of such

discrepancies include whether to leave quantity units implicit eg New York defeated Philadelphia

to and punctuation eg the use of vs for marking vs The Gold Index lost points

app ositions A full example of p orting the signature of a revision to ol from the acquisition domain to the

target domain using crep is presented indepth section D of App endix D

Manual p ostediting of p orted crep expression matches

Because crep expressions approximate semanticosyntactic patterns with lexicosyntactic ones the test

corpus sentences that matched the crep expressions of a revision to ol p orted signature need to b e manually

p ostedited in order to lterout the cases where the approximation is not valid

To illustrate the need for p ostediting consider for example the common target subpattern for the revision

to ol Adjunctization of Range into Instrument This subpattern reduced to the syntactic mark of the

20

instrument role namely the prep osition with immediately followed by an indenite NP Since crep is

a word and partofsp eech tag regular expression matcher and not a parser it has no real ability to delimit

NPs The crep expression for the syntactic mark of the instrument role was thus taken to b e simply

withIN an

But this expression do es not only match sentences with a genuine instrument PP such as

But futures outpaced the physical market with a gain in Decemb er contracts just after the close of

p oints to extending the premium to a healthy points

Unfortunately it also matches sentences where the presence of the bigram with a is purely acciden

tal such as

decline in futures and pressured the cash market lower Traders said the auction results combined with a

In this second sentence the NP a decline in futures is not an instrument adjunct but an ob ject argument

of the prep ositional verb to combine with In this case the lexicosyntactic pattern enco ded by crep

pro duces an erroneous approximation of the desired semanticosyntactic pattern The sentences that were

wrongly retrieved with this pattern need to b e ltered out by manual p ostediting The fact that each match

le was analyzed for verication kept this p ostediting phase a manageable task since there was only two

such les one for the chosen source pattern and one for the chosen target pattern to check for each to ol

20

Although in the general case a denite NP can also ll an instrument role I used the fact that in the basketball corpus all

NPs lling the instrument role in the target sentences of Adjunctization of Range into Instrument were denite to simplify

the search for corresp onding instrument roles in the sto ck market corpus

Summary of p ortability evaluation metho dology

The metho dology I have describ ed in the previous sections to evaluate p ortability of revision to ols to a new

domain can b e summarized as follows

Dene the acquisition domain signature of the revision to ol

a Index the whole cluster realization patterns of the acquisition domain by the revision to ols for

which they are either source or target pattern

b Decomp ose the set of whole cluster source patterns for the revision to ol into maximal subsets

sharing a common subpattern

c Rep eat b for the set of whole cluster target patterns

d Form a list of surface decrement pairs from the list of common source subpatterns and the list

of common target subpatterns

Considering the surface decrement pairs of the revision to ol signature in order rep eat the following

steps

a Find a test domain counterpart to each acquisition domain conceptual structure at play in the

target subpattern of the pair under consideration

b Check whether a crep expression was written during the robustness evaluation for this acquisition

target subpattern under consideration If not write one

c In this expression replace the crep subexpressions covering acquisition domain prop er names

by new ones for the corresp onding test domain names

d Run the altered crep expression on the test corpus If it matches a test corpus sentence go to

step f Otherwise pro ceed to step e

e Incrementally suppress replace and generalize other crep subexpressions in the expression

to account for conceptual lexical or rhetorical discrepancies b etween the acquisition and test

domains until it matches some test corpus sentences

f Postedit the le containing these matched sentences If it contains only erroneous approximations

of the sought target subpattern go to back to step e Otherwise pro ceed to step g

g Rep eat step a to f with the source subpattern of the pair under consideration If a valid match

can also b e found for this source pattern stop the revision to ol is p ortable Otherwise resume

the entire cycle with the next surface decrement pair in the revision to ol signature If there is no

next pair left stop the revision to ol is considered nonp ortable

Substeps e and f in the ab ove algorithm constitute a generateandtest approach to approximating

realization patterns by crep expression Typically changing or suppressing one crep subexpression results

in going from to o sp ecic an expression with no valid match to either a still to o sp ecic an expression with

no valid match or b to o general an expression with to o many matches to b e manually p ostedited The

range of expression sp ecicity can only b e explored using a trialanderror pro cess guided by the previous

run results This trial and error pro cess is in fact op enended it is always p ossible to write more complex

expressions manually edit larger match les or even consider larger corp ora in the hop e of nding a match

So one is left with having to decide somewhat arbitrarily that at a given p oint the likeliho o d of nding a

match is to o small to justify the cost of further attempts This is why the last line in the algorithm reads

considered nonp ortable as opp osed to simply nonp ortable The algorithm guarantees the validity

of p ositive results only In that sense the gures presented in the next section constitute a lowerb ound

estimate of the revision to ol p ortability

Results

In presenting the results of this p ortability evaluation I rst distinguish b etween dierent degrees of p orta

bility and give the degree of each revision op eration I then examine how p ortability is inuenced by two

factors the p osition of the op eration in the hierarchy of revisions and the frequency of its usage in the

acquisition domain

Degrees of p ortability

Portability of a revision op eration is established when two phrases forming a surface decrement pair ie a

pair source phrase target phrase for that op eration are found in the test corpus For example the two

test corpus phrases b elow

t

S the Hong Kong market

1

t

T the buoyant Honk Kong market

1

which dier only by the describ er buoyant form a surface decrement pair establishing the p ortability

of the revision op eration Adjoin of Describer This revision op eration allowing the opp ortunistic addi

tion of qualitative information to a referring NP was originally identied by the presence of similar surface

decrements in the acquisition corpus such as

a

the Boston Celtics S

1

a

T the hotsho oting Boston Celtics

1

I distinguish three degrees of p ortability in terms of semantic similarity b etween

t

The concept C that is added from the test source phrase to the test target phrase in the example

ab ove the current quality of the referred market

a

The corresp onding concepts C that is added from the acquisition source phrases to the acquisition

target phrases in the example ab ove the quality of the team referred to

As explained in Section to p ort each revision op eration signature a corresp ondence is established

from the concepts of the acquisition domain to those of the test domain Also this p ortability study was

restricted to historical acquisition domain concepts such as records or streaks I thus distinguish b etween

t

Same concept portability cases where the concept C added in the test corpus surface decrement is the

a

exact counterpart of the concept C added in the acquisition corpus surface decrement This is the

strongest typ e of p ortability

t

Dierent historical concept portability cases where the concept C added in the test corpus surface

a

decrement while not the exact conceptual counterpart of the concept C added in the acquisition

corpus surface decrement is nonetheless a historical concept

t

Dierent nonhistorical concept portability cases where the concept C added in the test corpus surface

decrement is not an historical concept This typ e of p ortability spawning the historicalnonhistorical

divide is the weakest

t t a a

The two surface decrement pairs ab ove S T and S S show that Adjoin of Describer

1 1 1 1

is a case of same concept p ortability The current quality of market fact added by the revision in the test

domain is the exact conceptual counterpart of the current quality of team fact added by the revision in

the acquisition domain

The two surface decrement pairs b elow show that the revision op eration Embedded NP Appositive

Conjoin is a case of p ortability for dierent historical concepts

t

the market is overcoming the weakness in IBM S

2

t

T weakness in IBM a market leader and key Dow Comp onent fol lowing the announcement of

2

another round of costcutting moves wil l likely weigh on the market

a

S to lead the New York Knicks to a victory over the Charlotte Hornets

2

a

T to p ower the Golden State Warriors to a triumph over the Los Angeles Clipp ers losers of

2

nine straight games

t t a a

The transformation from b oth S to T and S to T is the attachment of a new fact b oldfaced as

2 2 2 2

an app osition to an NP emb edded in a PP However whereas in the acquisition domain surface decrement

a a

S T this app osition is used to attach a streak information ab out the emb edded NP referent in the

2 2

t t

test domain surface decrement S T it is used to attach another kind of historical fact namely the

2 2

past status of the referent

The two surface decrement pairs b elow show that the revision op eration Adjoin Finite Time Clause

to Clause is a case of p ortability spawning the historical nonhistorical divide

t

S Trading volume was a busy mil lion shares

3

t

T Volume amounted to a solid mil lion shares as advances outpaced declines to

3

a

to lead the New York Knicks to a victory over the Charlotte Hornets S

3

a

T to lead Utah to a trouncing of Denver as the Jazz defeated the Nuggets for the th

3

straight time at home

a a t t

is the attachment of a new fact b oldfaced as a to T and S to T The transformation from b oth S

3 3 3 3

temp oral clausal adjunct to the original main clause However whereas in the acquisition domain surface

a a

decrement S T this temp oral adjunct conveys historical information ab out the statistic conveyed by

3 3

t t

the main clause in the test domain surface decrement S T conveys another statistic

2 2

Tables and gives the degree of p ortability for each of the evaluated revision op erations

The rst two columns contain the co de and name of the revision the third the numb er of o ccurrences in

the acquisition domain and the following four whether is it resp ectively same concept p ortable dierent

historical concept p ortable nonhistorical concept p ortable or not p ortable at all

Portability at various levels in the revision op eration hierarchy

The revisions whose p ortability degrees are given in tables and are the leaves of the revision op er

21

ation hierarchy presented in Section This hierarchy is repro duced here with the following graphical

conventions arcs have four shades of thickness corresp onding to the four p ossible degrees of p ortability

of the revison class they lead to The thicker the arc the more p ortable the class Only the thinest arcs

indicates that the no de b elow them corresp ond to a nonp ortable class Moreover the no des corresp onding

to classes that are p ortable at any degree are b oldfaced These conventions visualize how p ortability varies

with revision sp ecicity as the hierarchy is traversed They also visualize which classes of revisions are more

p ortable than others

The hierarchy toplevels are shown in Fig The subhierarchies down the no des adjoin conjoin

absorb and adjunctization and nominalization are resp ectively shown in Fig The

subhierarchies down the no des recast and nominalization are shown in Fig and the subhierarchies

down the no des demotion and promotion are shown in Fig

Before discussing how p ortability varies in this hierarchy it is imp ortant to rememb er the dierent typ es

of distinctions that split the classes at each level into the sub classes at the next level The hierarchy is seven

levels deep The rst level distinguishes b etween simple revisions that monotonical ly add a new fact to the

draft without changing the expression of the facts already conveyed from the complex revisions that reword

the expression of the some existing facts in order to accommo date the new fact into a concise linguistic

form The second level contains the eight main revision to ols Each to ol is characterized by a dierent typ e

of structural transformation that the draft undergo es during the revision Since the simpler revison to ols

can apply to a variety of syntactic constituents the third level distinguishes revisions resulting from the

application of these to ols to dierent syntactic ranks eg Adjoin to Clause vs Adjoin to Nominal The

next two levels further sub divide revisions in terms of b oth the semantic role and syntactic realization of the

22

phrase they attaches to the draft eg Adjoin Finite Temp oral Clause to Clause vs Adjoin NonFinite

Result Clause to Clause To add a given content unit some revision to ols can b e equally well applied to

dierent draft sub constituents of the same syntactic category but emb edded at dierent levels in the draft

21

Because the p ortability evaluation fo cused on the revision op erations that could b e used to add historical information the

branches in the hierarchy of Section corresp onding to revision op erations used exclusively in the original corpus to add

nonhistorical facts are not shown here

22

Or displace in the draft for complex revisions

Co de Name Usage Portable Non

sem sem 6

histo : histo

A Adjoin Classier

A Adjoin Describ er

A Adjoin Relative S to Top NP

A Adjoin Relative S to Top NP w Abridged Ref

A Adjoin Relative S to Top NP w Abridged Delete

A Adjoin Relative S to Top NP

A Adjoin Relative S to Emb edded NP w ReOrdering

A Adjoin NonFinite S to NP

A Adjoin NonFinite S to NP w Abridged Ref

A Adjoin Partitive to NP

A Adjoin Frequency PP to S

A Adjoin Frequency PP to S w Abridged Ref

A Adjoin CoEvent NonFinite S to S

A Adjoin CoEvent NonFinite S to S w Abridged Ref

A Adjoin CoEvent NonFinite S to S w Deleted Ref

A Adjoin Result NonFinite S to S

A Adjoin Time Finite S to S w Abridged Ref

C Top NP App ositive Conjoin

C Top NP App ositive Conjoin w Abridged Ref

C Top NP App ositive Conjoin w Deleted Ref

C Emb edded NP App ositive Conjoin

C NP Co ordinative Conjoin

C NP Co ordinative Conjoin w scop e mark

C Top S Co ordinative Conjoin

C Top S Co ordinative Conjoin w Abridged Ref

C Emb edded S Co ordinative Conjoin w Abridged Ref

C Emb edded S Co ordinative Conjoin w Abridged Ref

B Absorb NP in S as PP Instrument

B Absorb NP in S as PP Instrument w Abridged Ref

B Absorb NP in S as CoEvent w Agent Control

B Absorb NP in S as Mean w Agent Control

B Absorb NP in S as part of Aected NP App osition

B Absorb NP in NP as PP qualier

Total

Table Portability degree of simple revision op erations

Co de Name Usage Portable Non

sem sem 6

histo : histo

R Recast Classier as Qualier

R Recast Lo cation as Instrument

R Recast Range as Time

R Recast Range as Instrument

Z Adjunctize Created as Instrument

Z Adjunctize Range as Instrument

Z Adjunctize Range as Instrument w Abridged Ref

Z Adjunctize Range as Instrument w Deleted Ref

Z Adjunctize Range as Instrument w Demoted Agent

Z Adjunctize Lo cation as Instrument w Abridged Ref

Z Adjunctize Lo cation as Instrument w Abridged Ref

Z Adjunctize Aected as Opp osition

N Nominalize w Ordinal Adjoin

N Nominalize w Ordinal and Classier Adjoin

N Nominalize w Ordinal and Qualier Adjoin

D Demote Aected to Aected Qualier

D Demote Aected to Aected Determiner

D Demote Score to CoEvent Score

P Co ordination Promotion

P Co ordination Promotion w Adjunctize

Total

Table Portability degree of complex revision op erations Monotonic Revisions Non−monotonic Revisions

AdjoinAbsorb Conjoin RecastAdjunctization NominalizationDemotion Promotion

Figure Revision op eration hierarchy toplevels

Adjoin

NP−Adjoin S−Adjoin

partitive classifier describer qualifier frequency result time co−event 1 25 5

relative−S Non−finite−S PP Non−finite−S Finite−S Non−finite−S 4 1

top bottom abridged full deleted full abridged full abridged abridged full deleted ref ref ref +reorder ref ref ref ref ref ref ref

20 1 3 9 2 2 10 13 3 12 2 1

Figure Adjoin revision op eration hierarchy

Conjoin

NP−Conjoin S−Conjoin

apposition coordination coordination

top bottom top bottom abridged full deleted +scope full abridged full abridged ref ref ref mark ref ref ref ref

15 5 4 1 1 2 2 5 1 1

Figure Conjoin revision op eration hierarchy Absorb

NP−Absorb S−Absorb

in−NP in−S in−S

as−qualifier as−instrument as−affected−apposition as−mean as−co−event

1 2 1 3 1

Figure Absorb revision op eration hierarchy

Adjunctization

>opposition >instrument

affected> range> created> location> 1 +agent demotion

abridged full deleted full abridged ref ref ref ref ref

727 3 1 14 5 4

Figure Adjunctization revision op eration hierarchy

Recast Nominalization

NP−Recast S−Recast +ordinal +ordinal +ordinal +classifier +qualifier classifier> location> range> 1 2 1

>qualifier >instrument >time >instrument

10 9 1 1

Figure Recast and Nominalize revision op eration hierarchy

Demotion Coordination Promotion

score> created> affected> simple w/ Adjunctization 1 1 >score(co−event) >qualifier(affected) >determiner(affected)

1 2 1

Figure Demotion and Promotion revision op eration hierarchy

Depth No des Portable Non

sem sem 6 Total

Histo : Histo

Monotonicity

Structural transformation

Draft phrase syntactic rank

Semantic role andor syntactic realization

of added or displaced phrase

Draft phrase emb edding depth

Accompanying side transformation

Table Portability from top to b ottom of the revision op eration hierarchy

structure For example to add a streak information on the draft phrase

to power the Golden State Warriors to a triumph over the Los Angeles Clippers

the same revision to ol Adjoin Relative Clause to NP can b e applied to the emb edded NP referring

to the losing team yielding

to power the Golden State Warriors to a triumph over the Los Angeles Clippers who lost for

the ninth straight time

Alternatively it can b e applied to the toplevel NP conveying the game result as a whole yielding

to power the Golden State Warriors to a triumph over Los Angeles that extended the Clip

p ers losing streak to nine games

The next level in the revision hierarchy thus distinguishes b etween the applications of these versatile

to ols at various embedding depths in the draft Finally at the lowest level the application of a revision to ol

is sometimes accompanied by a side transformation of the draft whose ob ject is not add further content

but instead to take into account the new presence of the added phrase and avoid rep etitions ambiguity etc

eg the abridged reference Los Angeles instead of the complete the Los Angeles Clippers in the last

example ab ove after the addition of a second reference to the team worded the Clippers

In order to quantitatively determine how each of the ab ove distinction aects p ortability the p ercentage

of no des with each p ortability degree at each level down the hierarchy is given in Table The rst column

indicates the prop erty used to branch the hierarchy at that level and the second numb er of revision classes

at that level The remaining columns give the p ercentage with the absolute numb er in parenthesis of

those classes which are resp ectively same concept p ortable dierent historical concept p ortable dierent

nonhistorical concept p ortable p ortable of any kind and non p ortable

Table contains several noteworthy results in b oldface The rst is the striking fact that al l of the

main eight classes of revision to ols identied in the sp orts domain are same concept portable to the sto ck

market domain For everyone of them I have found at least one phrase pair in the sto ck market corpus

attesting to their usage for adding the exact conceptual counterparts of the sp orts domain It is quite

impressive that even usage of the revision to ols involving the most complex transformations were detectable

in the sto ck market corpus It is similarly remarkable that even those that were used only a few times in

the sp orts corpus were nonetheless found in the sto ck market corpus As the conceptual distance b etween

basketball and nance is fairly large these result suggests that the generation approach and linguistic data

presented in this thesis are likely to apply to other quantitative domains such as auditing meteorology

costanalysis or lab or statistics

The second interesting result is that a ma jority of the very sp ecic revision op erations lo cated at the

b ottom of the hierarchy are p ortable even though they distinguish b etween applications of the revision to ols

into dierent semantic syntactic and rhetorical contexts Since each such op eration is implemented via a

revision rule in streak this result suggests that a ma jority of those rules could b e reused to implement

a sto ck market rep ort generator As explained in Section the rule base in not a at list but instead a

Functional Grammar FG where commonalities b etween rules are factored out Each arc in the hierarchy

roughly corresp onds to a branch in this FG Since there are p ortable arcs and nonp ortable arcs in the

hierarchy up to twothird of the revision FUG branches develop ed for a given quantitative domain should

b e reusable for the next one

A third noteworthy result is that among the six prop erties used for classication in the revision op eration

hierarchy only two bring a sharp drop in overall p ortability The rst of these prop erties is the semantic role

and syntactic realization of the phrase added or displaced by the revision For example in Adjoin of Clause

to Clause as Time vs Adjoin of PP to clause as frequency the resp ective distinctive semantic roles

and syntactic categories of the added constituent are time vs frequency and clause vs PP Sub dividing the

revision set along these lines brings down its overall p ortability by Let us examine some examples

where such distinctions did not p ort First consider Adjoin of Clause to NP There are two distinct

syntactic forms for such qualifying clauses relative and nonnite b oth used in the sp orts domain as

attested by the following sentences

scored points and grabbed rebounds Thursday night to pace the Houston

Rockets to a triumph over the Chicago Bul ls that completed a twogame sweep

Drazen Petrovic scored points and added points and rebounds Friday night

leading the New Jersey Nets to a victory over the completing a sweep of

the season series

In the b oldfaced relative clause unambiguously qualies the game result lexicalized by the noun

triumph In the light of the structural and semantic similarity b etween these two sentences I also

interpreted the b oldfaced nonnite clause in as qualifying the game result this time lexicalized by

victory Note however that this second sentence has another p otential reading with the b oldfaced

nonnite clause mo difying the toplevel clause conjunction

In the sto ck market domain I could nd an instance of Adjoin to NP for the relative clause case

They said the market got an initial boost from overseas markets which proved strong for the

second straight day

but none for the nonnite clause cases

There were cases of Adjoin of NonFinite Clause but they were all mo difying other clauses not NPs

eg

The bluechip Hang Seng Index which sank points Friday soared points to

snapping its three session losing streak

23

Use of such abridged syntactic forms like nonnite clauses whose attachments are not always clearcut

are p ervasive in the quantitative summarization sublanguage In some cases it may b e b est to reconsider

some interpretation choices based on data from single domain in the light of crossdomain data Such

reconsideration for examples like the one ab ove would have improved the p ortability results

The example ab ove was a case where distinctions in terms of the syntactic form of the phrase added by

a revision to ol aected its p ortability Let us now lo ok at a case where p ortability is aected by distinction

in terms of the semantic role of a phrase displaced by the revision This is the case of Adjunctization

Verb Argument into Instrument PP In the sp orts domain the adjunctized argument could function as a

24

variety of semantic roles including

Created where scored POINTS

b ecomes Kevin Edwards tied a career high with POINTS

Range where the Milwaukee Bucks posted A VICTORYOVER THE

b ecomes the Milwaukee Bucks snapp ed a losing streak at ve games with A VICTORY

OVER THE DETROIT PISTONS

23

Some corpus sentences are even atly ungrammatical

24

The set of semantic roles used for this analysis are those one used in surge which are themselves derived from various

systemic linguistic sources Fawcett Halliday Winograd

Location where the Denver Nuggets rol led to A VICTORYOVER THE UTAH JAZZ

b ecomes the Denver Nuggets ended their three game losing streak with A VICTORY

OVER THE UTAH JAZZ

In the sto ck market domain only cases of adjunctized Range roles could b e found eg

the market posted A POINT REBOUND

b ecomes the market b egan showing signs of recovery on Monday with A POINT REBOUND

Note that all three semantic roles displaced in the sp orts domain surface as ob ject syntactic roles Had

these cases not b een further decomp osed in terms of semantic roles the p ortability measurement would have

improved

The two examples just presented show that as so on as semantics is taken into account p ortability

sometimes dep end on gray areas like interpretation choice or analysis granularity The other cases are

mostly due to discrepancies b etween corresp onding conceptual structures in the two domains

After semantic roles the other main factor aecting overall p ortability of a revision to ol is the dierent

side transformations that sometimes accompany them Taking into account these side transformations brings

down p ortability by another

Why is this the case Consider for example the strategy for abridging references presented in section

In the example of Adjunctization of Range into Instrument PP ab ove the displaced referring

NP is left intact in the revised draft In the sp orts domain there is a variant to this revision op eration where

the displaced NP is abridged This variant is illustrated by the example phrases b elow

the Chicago Bulls p osted a victory over the

the Chicago Bulls remained unb eaten against Minnesota alltime with a victory over

the Timberwolves

The revision from to adds a second reference to the losing team worded Minnesota To avoid

rep etition the initial reference to this team is abridged from the Minnesota Timb erwolves to only the

Timb erwolves This strategy relies on the fact that in the sp orts domain most entities have comp ound

names dierent parts of which can b e used unambiguously in dierent references for the targeted audience

ie sp orts fans For teams the franchise name and the home city for players the rst and last name etc

This is rarely the case for nancial entities Therefore this side transformation is much less frequent in

that domain causing the revision op erations incorp orating them such as Adjunctization of Range into

Instrument PP with Reference Abridging to b e nonp ortable

Having compared p ortability at various depth levels in the revision hierarchy I now contrast p ortability

on each side of this hierarchy A quick lo ok at tables and indicates that the simple revisions on

the left side of the hierarchy tend to b e more readily p ortable than the complex ones on the right side I

computed this leftright side contrast at the two b ottom levels of the hierarchy At the very b ottom level

ie when distinguishing revision op erations in terms of b oth revision to ol application and accompanying

side transformation simple revisions were p ortable compared to only for complex revisions This

dierence is not striking However at the next level up ie ignoring side transformations the p ortability

of simple revisions go es up to while those of complex revisions remains no higher than

Intuitively this dierence seems signicant How likely is it for such a dierence to app ear at random

25

A Fisher test comparing the p ortablenonp ortable prop ortion for simple revisions to the corre

sp onding one for complex revisions tells us that the probability of this dierence b eing an artifact of

our particular set of data is This is very close to the probability that is generally accepted as

statistically signicant Simple revisions thus indeed seem to b e more p ortable than complex ones Another

Fisher test comparing the p ortablenonp ortable prop ortion for simple revisions at the nexttob ottom level

to the corresp onding one at the b ottom level indicates that there is a probability that

this dierence is due to chance Therefore when simple revisions do p ort it is mostly b ecause the side

transformation accompanying it do es not For complex revisions it is instead mostly b ecause the sp ecic

semantic roles aected by the revision do not p ort

25

Thanks to Vasileios Hatzivassiloglou who suggested this test as the most appropriate one for my small sized sample data

Occurrences Total Portable Non

sem sem 6 Total

histo : histo

<



Table Portability and usage frequency in acquisition corpus

Portability and usage frequency in the acquisition corpus

Having examined the inuence of the sp ecicity and complexity of a revision op eration on its p ortability

I now examine the inuence of another factor frequency of usage in the acquisition corpus Just as one

would a priori exp ect simple revisions to b e on the average more p ortable than complex ones one would

also a priori exp ect the most frequently used ones to b e on the average more p ortable than the seldom

used ones Is this indeed the case Table indicates for each o ccurrence frequency how many revision

op erations o ccurred that many times in the acquisition corpus It also gives the breakdown of this numb er

in terms of p ortability degrees In order to have a signicant numb er of revision op erations with more than

a few o ccurrences the level of decomp osition used for this study was next to b ottom of the hierarchy ie

distinctions in terms of accompanying side transformations were ignored

Summing the overall p ortability for growing ranges of frequencies

shows that p ortability clearly do es not monotonically grow with frequency usage in the acquisition cor

pus No clear trend app ear b efore o ccurrences Ab ove this frequency however all revision op erations

are p ortable even b etter they are all sameconcept p ortable Below this threshold the average overall

p ortability is Intuitively this dierence seems signicant However a Fisher test comparing the

prop ortion b elow the threshold to the one ab ove reveals that the probability of this b eing an

artifact of our small sample is to o high to consider this dierence statistically signicant More data

would need to b e analyzed to conrm b eyond doubt the intuition that frequently used revision op erations

are signicantly more p ortable than seldom used ones

Chapter

Related work

There are four main topics of natural language generation that are directly related to the research presented

in this thesis

Summary rep ort generation

Generation with revision

Incremental generation

Evaluation in generation

Another central theme of this thesis is the generator architecture However b ecause a great numb er of

generators have b een implemented and their variety in terms of architectures is only matched by their variety

in terms of applications I cannot comprehensively review generator architectures in the present thesis For

such a review with a sp ecial fo cus on the relation b etween architecture and lexical choice see Robin

In this section I mention architecture issues only for systems either whose application summary rep ort

generation or approach to generation revisionbased or incremental is directly related to those of streak

Related work in summary rep ort generation

Prior to streak six main systems generated natural language rep orts to summarize quantitative data ana

Kukich Kukich fog Bourb eau et al Polguere gossip Carcagno and Iordanska ja

lfs Iordanska ja et al semtex Ro esner and sage Roth et al I briey review

each of these systems in the following subsections

Other generation systems that pro duced rep orts are not directly relevant to the present thesis b ecause

the rep orts are not sp ecically summaries and b ecause they worked in qualitative domains eg Danlos

generator Danlos pauline Hovy Kalitas system Kalita texplan Maybury

While there are systems that attempt to summarize textual input such as newswire articles by selecting

representative sentences in the text eg Rau they are not directly relevant either b ecause they

involve no generation

Ana

The eld of summary rep ort generation was pioneered by Kukich Her system ana Kukich summarizes

the daily uctuations of several sto ck market indexes from halfhourly up dates of their values A rep ort

generated by ana is given in Fig Internally ana consists of a pip eline of four comp onents the rst

written is C and the remaining three in ops

Thursday June

wall streets securities markets meandered upward through most of the morning b efore b eing pushed downhill

late in the day yesterday the sto ck market closed out the day with a small loss and turned in a mixed showing

in mo derate trading

the Dow Jones average of industrials declined slightly nishing the day at o p oints the

transp ortation and utility indicators edged higher

volume on the big b oard was shares compared with shares on Wednesday advances

were ahead by ab out to at the nal b ell

Figure An example rep ort generated by Ana from Kukich

The fact generator which compiles the halfhour up dates of the sto ck market indexes into facts Facts

are aggregates of various statistics over a given p erio d of time They are represented by ops Working

Memory Elements WMEs An example fact is shown in Fig

The message generator which groups facts into clausesized content units called messages and also

represented as ops WMEs An example message is shown in Fig

The discourse organizer which groups and orders the messages into paragraphs It also collapses

together messages which share several features eg when two consecutive messages dier exclusively

with resp ect to their sub jects these messages get collapsed into a single message with a comp ound

sub ject

The text generator which go es over the ordered message list pro duced by the discourse organizer maps

each message onto a clause and combines clauses together into complex sentences

In ana the text generation subtasks dened in section are thus distributed among the ab ove com

p onents as follows

Content pro duction and selection are p erformed by the fact and message generators

Discourse level content organization is p erformed by the discourse organizer

Phrase level content organization lexicalization semantic grammaticalization morphosyntactic gram

maticalization and linearization are all p erformed by the text generator

The tasks p erformed by anas text generator thus corresp ond directly to the tasks implemented in

1

streak anas text generator relies on two key knowledge sources a phrasal lexicon that denes direct

mapping from messages to handco ded predicate clause patterns and sub ject nominal patterns and a set

of rules dening how phrasal entries can b e combined to form multiple clause sentences that uently combine

several facts An example of phrasal entries is given in Fig Such entries comprise up to eight words

Each output sentence is built by assembling two or three of such phrasal entries

Systems based on the MeaningText Theory

Three imp ortant summary rep ort generation systems have b een built in recent years by the same research

team distributed in three cites Odysee Research Asso ciates University of Montreal and Cogentex The

systems all use the MeaningText Theory MTT Melcuk and Pertsov as underlying linguistic mo del

The rst of these system is fog Bourb eau et al It pro duces daily lo cal marine weather bulletins

in b oth English and French from meteorological measurements An example English rep ort generated by

2

fog is given in Fig As opp osed to most other generation systems which are research prototyp es fog

1

Except that streak has the option not to express all the messages it is given in input when all of them cannot t in a

readable sentence It is thus also p erforming nal content selection In constrast the text generator of ana has no such option

It must convey all the element in its input message list

2

In fact probably even all others except plandoc Kukich et al

Example of fact make fact factname halfhourstat

indicatorname DowJonesindustrialaverage

indicatortype composite

date

time am

currentlevel

direction down

degree

cumulativedirection up

cumulativedegree

highorlowmark nil

Example of message make message date

topic DowJonesindustrialaverage

subtopic DowJonesindustrialaveragestatus

subjectclass DowJonesindustrialaverage

direction down

degree small

time closing

variablelevel

variabledegree

Corresponding sentence the Dow Jones average of industrials declined slightly

finishing the day at off points

Figure Example fact and message in ana from Kukich

make phraselex phrasetype predicate

topic generalmarket

subtopic interestingmarketfluctuation

duration long

degree small

direction up

time early

subjecttype name

subjectclass market

verbpastsingularinflection meandered

verbpastpluralinflection meandered

verbpresentparticipialinflection meandering

verbinfinitiveinflection to meander

predicateremainder upward through most of the morning

random

length

Figure Example phrasal lexicon entry in ana from Kukich

Winds nothernly to knots b ecoming northwest Wednesday morning Snow and rain tap ering to urries

Wednesday morning Visibility zero to in precipitation Temp eratures near zero

Figure An example rep ort generated by fog from Polguere

The system was op erating for hours minutes and seconds Usage was particularly intense b etween

and with idle time only cycles during this p erio d Seven users worked on the system

Five of them used mostly compilers CLispFortran and the Prolog interpreter VLADIMIR and LEO read

numerous les VLADIMIR was interested in system priority tables LEO listed many user les from his

own group He initiated large print jobs using these les VLADIMIR failed to change access parameters

for system les No mo dication to system les were noted

Figure An example rep ort generated by gossip from Carcagno and Iordanska ja

is a nished pro duct in everyday use at weather centres in Eastern Canada Compared to ana in terms of

conceptual summarization fog averages input data along spatial dimensions in addition to temp oral ones

In terms of linguistic summarization it relies on the telegraphic style p eculiar to weather forecasts instead

of relying on the clause combining journalistic style typical of newswires The second of these systems is

gossip Carcagno and Iordanska ja which summarizes the activity of computer users from the audit

trail pro duced by the op erating system of the machine they are logged on Its input is intended to the

system administrator in detecting system usage that is suspicious from a security standp oint An example of

such a summary is given in Fig As opp osed to fog gossip is only a monolingual research prototyp e

3

The latest system in this line of work is lfs which generates bilingual rep orts summarizing Canadian lab or

market statistics

These three systems are all implemented using an ob jectoriented extension of prolog and share the

same architecture This architecture consists of a pip eline of three comp onents

A text planning comp onent

A lexicalization comp onent

A linguistic realization comp onent

Multilingual generators based on this architecture have a single text planning comp onent but a distinct pair

of lexicalization and linguistic realization comp onents for each output language

The text planner is based on the notion of topic tree A topic tree represents a stereotypical textual

organization pattern followed by the rep orts in a given application domain It is comparable to one particular

traversal of a textual schema McKeown The no des of a topic tree are domain concepts and its

arcs are very general relations which can hold b etween concepts in many domains eg object aspect

subaction attribute element Each no de in the topic tree is represented as an prolog ob ject The

elds of the ob ject enco de b oth the concept represented by the no de and the arcs linking the no de to the

other no des in the topic tree The metho ds of the ob ject encapsulate the various text planning pro cedures

that can concern the concept represented by the no de Text planning starts by a rst topdown traversal of

the topic tree When a no de N with concept C is visited the metho ds of N are executed with the following

p ossible eects

Instantiation of C into a fact from the input statistic database if there is an instance of C in the input

database

Deletion of the N from the topic tree if there is no instance of C in the input database

3

Also in English and French prop now Conceptual cont before

system work−with sys−u−dur Communicative obj dur 7:32:12 agnt

user−set Representation

duration omega Deep 2 1 3 7:32:12 use1 system Semantic 2 1 1 1 1 > 1 before_now user det Representation 1 > 1

‘‘The system was used for 7 hours 32 minutes and 12 seconds.’’ Sentence

Figure Conceptual Communicative vs Deep Semantic representations the MeaningText Theory from

Carcagno and Iordanska ja

Addition to the topic tree of new arcs and no des added to the topic tree b elow N these additions

implements domain reasoning allowing the text planner to deduce additional facts from those present

in the input

After this rst traversal is completed the resulting instantiated topic tree is again traversed topdown

Two main tasks are p erformed during this second traversal arc merging and no de ordering If several arcs

down a given no de share the same lab el and lead to no des containing instances of the same concept these

arcs are merged into a single arc leading to a single no de containing the list of instances Linguistically such

merging op erations ultimately translate into replacement of a verb ose conjunction at the clause rank eg

Vladimir read numerous les and Leo read numerous les by a concise conjunction at the nominal rank

eg Vladimir and Leo read numerous les After arc merging each no de contains a sentence sized

content chunk Each no de is then assigned a numb er corresp onding to the sentential p osition in the output

text At the end of this second traversal each no de in the topic tree contains a at conceptual network

called a Conceptual Communicative Representation CCR enco ding the content of one rep ort sentence

The CCR of each topic tree no de is then passed on to the lexicalization comp onents which accesses a

dictionary and pro duces a corresp onding Deep Semantic Representation DSemR The DSemR is the input

to the linguistic comp onent Compared to the CCR the DSemR is also a at network but in which no des

are no longer languageindep endent concepts but instead word senses of a sp ecic language An example

CCR together with the corresp onding English DSemR is given in Fig

The task of the linguistic comp onent is to generate a natural language sentence expressing the content

enco ded in the DSemR The linguistic comp onent is an implementation of the MeaningText Theory MTT

Melcuk and Pertsov The MTT is a lexicalist straticational declarative and generationoriented

linguistic theory where sentences are represented at ve distinct layers

4

Deep SEMantic Representation DSemR is a at network whose no de are word senses which are

atomic within the theory ie which cannot b e further decomp osed in terms of other word senses The

DSemR of a sentence represents the whole recoverable meaning of the utterance whether explicitly

conveyed or not The DSemR may also include an annotation sp ecifying the theme vs rheme partition

of the sentence

5

Surface SEMantic Representation SSemR is also a at network whose no de are word senses It

diers from the DSemR in that it may comprise nonatomic word senses it represents only the

explicit content of the sentence and it indicates which no de corresp onds to the main verb The

mapping from DSemR to SSemR thus involves reducing some subnets into single no des lab elled by

semantically rich words dropping no des corresp onding to content b est left implicit and chosing which

explicit content unit should head the linguistic structure

Deep SYNTactic Representation DSyntR is a tree whose structure represents the constituency and

dep endency relations inside the sentence The no des of the tree are either op enclass words or lexical

functions The MTT comprises ab out such functions to mo del common semantic relations b etween

two words There are syntagmatic lexical functions such as Oper relating a noun to its agent oriented

6

supp ort verb eg perreb ound to grab as well as paradigmatic lexical functions such as

Anti relating antonyms eg Antivictory defeat The arcs of a DSyntR are either numb ers

distinguishing the various elements in a word argument structure attr for any typ e of attributive

relation coord for any typ e of co ordinated elements and append for any typ e of app osited parenthetical

or comment elements

Surface SYNTactic Representation SSyntR is a tree that matches the actual syntactic structure of

the sentence The SSyntR include no des for the closedclass words in addition to the op enclass word

no des that it inherits from the DSyntR In the SSyntR the lexical function no des have b een replaced

by subtrees corresp onding to their values Finally the arcs of an SSyntR are syntactic roles such as

Subject Object Determiner etc

MORPHological Representation MorphR is an ordered list of words each annotated with all the

morphological features necessary to compute its inection

The Conceptual Communicative Representation CCR of MTTbased systems corresp onds to the Deep

7

Semantic Sp ecication DSS of streak Therefore streak covers the generation subtasks carried out

by the lexicalization and linguistic realization comp onents in MTTbased generators However there is no

direct corresp ondence b etween the sentence representation layers of the MTT and those of streak

Mapping a at conceptual network onto a natural language sentence involves four main subtasks

Cho osing which element in the net to explicitly convey in the sentence explicitation

Cho osing how to group elements into constituents and cho osing the structural dep endencies among

these constituents hierarchization

Mapping the some of the chosen elements onto op enclass words lexicalization

Mapping the remaining chosen elements onto syntactic features and closedclass words grammatical

ization

In streak explicitation and hierarchization are p erformed rst followed by lexicalization and then

8

grammaticalization In the MTT it is lexicalization that is p erformed rst followed by explicitation and

hierarchization and nally grammaticalization What is gained by lexicalizing the whole conceptual input

b efore deciding which part of it to convey explicitly remains unclear Note that the system fog diers from

4

Sometimes simply called SEMantic Representation SemR

5

Sometimes called Reduced Semantic Represenation RSemR

6

In general lexical functions have multiple values so for example an other p ossible value for Operreb ound is to haul

in

7

cf section for the denition of streaks internal representation layers

8

At least most of it Lexicalizations resulting from application of lexical functions is p erformed after explicitation and

hierarchization

b oth gossip and lfs in that it do es not make use of the two semantic representation layers and directly maps

CCR networks onto DSyntR trees One advantage of early lexicalization is that word argument structure

can guide the hierarchization pro cess But aside from verbs only a few words take arguments Moreover

adverbial relations linking most clauses inside complex sentences are not lexically constrained Thus at the

complex sentence rank content organization options are largely underconstrained by lexical choices It thus

seems that the lexicalist avor of the MTT is more motivated by its origin in lexicographical work than by

its particular suitability to sentence generation

Other summary rep ort generators

Another system implemented for the domain of lab or market statistics this time in German is semtex

Ro esner It pro duces summaries of similar content and style to those generated by lfs It is also

implemented in an ob jectoriented fashion it uses flavors a lispbased ob ject system whereas lfs uses a

prologbased ob ject system However in terms of architecture semtex is far closer to ana than to lfs

It consists of a pip eline of three comp onents

amtex which queries an input lab or statistic database and pro duces case frames each one representing

a fact to convey in the rep ort Which typ es of facts get pro duced by amtex dep end on parameters

dening how statistical values should divided into various p opulation subgroups Case frames are

enco ded as ob jects and comprise the following elds direction with values increase decrease or

unchanged quantity timeperiod manner from by and to

semsyn which relies on a phrasal lexicon to map each element in the list of case frames it receives

from amtex onto a corresp onding partially lexicalized syntactic tree

sutras which pro duces a German sentence from the partially lexicalized syntactic tree it receives

from semsyn

semtex relies on co ordination with ellipsis as its main linguistic summarization device

The sage system Roth et al generates a combination of text and charts summarizing the current

status of a large engineering pro ject that help managers quickly resp ond to unforeseen diculties or missed

deadlines and keep the pro ject on track This system suggests the great p otential for oceautomation

applications of summary rep ort generation However while fo cusing on the issue of co ordinating textual and

graphical media sage relies on adho c techniques for text generation cf Roth et al p

Comparison with STREAK

An imp ortant rst dierence b etween previous summary rep ort generation systems and streak is that they

p erform two tasks that are not implemented in the current version of the streak prototyp e conceptual

summarization and combination of multiple sentences in a paragraph These tasks are the resp onsibility

of the fact generator and discourse planner in the general generation architecture presented in Fig of

section streak fo cuses on content realization and phrase level content planning bringing ab out several

signicant improvements over previous systems with resp ect to these two tasks In particular streak

Rep orts facts in their historical context

Trades o informativeness and conciseness against readability

Generates more complex sentences

Scales up b etter

Enco des more paraphrasing p ower

I briey elab orate in each of these p oints in the following sections

Historical context The most striking improvement of streak compared with previous summary gener

ation systems is that it systematically provides the historical background of the new events it relates The

rep orts generated by streak thus not only summarize a basketball game but also contextualize it With

the exception of ana previous systems summarized their input statistics out of context ana do es handle a

9

very restricted class of historical information alltime highs and lows ana detects such extrema by having

the alltime high and low values of each index enco ded in its content determination rules Since every time a

record is broken these values need to b e up dated by hand in the rules in which they are used this treatment

of historical information is adhoc It could not b e extended to the variety of records and streaks conveyed

in the summaries generated by streak

Trading o conicting goals The great originality of streaks architecture is that it allows the nal

decision of whether to include a complementary fact in the rep ort to b e made opportunistical ly under surface

form constraints This is in contrast with any previous summary generation system It makes p ossible explicit

tradeos b etween the three inherently conicting goals of summarization maximize information

minimize space and maximize readability This typ e of tradeo is illustrated in the rst example run

of section when streak rejects several candidate facts for inclusion in the lead sentence of the rep ort

on the basis of corpus observed limits on sentence complexity A system where all the facts to convey are

decided b efore any of them is linguistically realized cannot make that kind of decision This was the case

for all previous summary generators

Sentence complexity streak generates more complex sentences than any other generation system They

are more complex in that they convey more prop ositions are syntactically deep er and contain more words

The table b elow compares the maximum factual density syntactic depth and lexical length in sentences

from example rep orts generated by three previous summary rep ort generators and streak It shows a clear

dierence in the overall complexity b etween the sentence generated by previous systems and those generated

by streak

gossip fog ana streak

Factual density

Syntactic depth

Lexical length

These gures are estimates that I made by lo oking at the few example rep orts provided in publications

10

ab out these systems For lexical length I counted the total numb er of words b oth closedclass and op en

class For syntactic depth I parsed the deep est sentences by hand For factual density I represented the

content of the most informative sentences as a set of predicate calculus formulas For example the content

11

of the third sentence in Fig can b e represented by the ve predicates b elow and thus conveys facts

useuserccompilermostlyp erio d

useuserlispcompilermostlyp erio d

useuserfortrancompilermostlyp erio d

useuserprologinterpretermostlyp erio d

useuserlispcompilermostlyp erio d

As explained in section it is the draft and revision approach of streak that allows it to pack

in so many facts in such complex sentence structures Planning and realizing the same sentences using the

onepass approach of previous summary rep ort generators would b e problematic This is esp ecially true

9

The mention of the big b oard volume of the preceding day in the rep ort of Fig is not historical information It refers

to the initial volume of the current day which coincides to the nal volume of the preceding business day The scop e of this

rep ort is indeed limited to the current day

10

It is therefore by no mean the result of a rigorous systematic comparison It nonetheless gives a fair idea of the resp ective

complexities of the sentences generated by these systems

11

There are of course many dierent p ossible ways to enco de the content of a sentence in such predicate form For the purp ose

of such comparison the only concern is to remain consistent for sentences generated by dierent systems which I have tried to

do as much as p ossible

for systems based on the MeaningText Theory MTT which are micro co ded and lexicallydriven These

complex sentences contain to o many words to b e assembled in one pass and as noted in section their

toplevel structure is devoided of lexical constraints

Scalability Among previous summary rep ort generators ana remains the one generating the most com

plex and uent sentences It achieves uency for fairly complex sentences by relying on a phrasal lexicon

The entries in this lexicon are phrases comprising up to words simultaneously realizing several facts in

an idiomatic way To illustrate how this approach would generate streaks domain sublanguage consider

the nal draft in the rst example run of streak given in section

Dal las TX Charles Barkley matched his season record with points and Danny Ainge came o the

bench to add Friday night as the Phoenix Suns handed the Dal las Mavericks their franchise worst th

defeat in a row at home

Using a phrasal lexicon like the one of ana such a sentence could b e macroco ded from only stored phrasal

entries In constrast streak microco des this sentence from dierent entries made of indvidual words or

collo cations

The advantage of the macroco ded approach it is that it circumvents the identication of the complex

constraints that inuence the generation of these phrases It has two drawbacks First it cannot exploit

summarization p otential lying b elow the clause rank For example the addition of a complex historical fact

by only one or two words implemented in streak cf the rst example run of section could not b e

done with a phrasal lexicon

Second it makes scaling up the paraphrasing p ower prohibitively costly since it requires the handco ding

of a combinatorially explosive numb er of phrases For example supp ose that a wordbased lexicon contains

31

an average of synonyms p er concept With only entries such a lexicon could generate paraphrases

12

of the sentence ab ove Attaining such paraphrasing p ower with a phrasal lexicon similar to that of ana

316

would require handco ding entries In section I present a quantitative evaluation

of the resp ective extensibility of the macro co ded onepass approach and the micro co ded twopass approach

based on corpus data

This scalability problem of ana also applies to semtex since it also relies on a phrasal lexicon It

do es not however applies to generators based on the MeaningText Theory which build sentences from the

individual word entries of the Encyclopedic Combinatorial Dictionary ECD whose format is an integral

part of the theory With this approach the main problem is not scalability but rather sentence complexity

as explained in the previous section

Paraphrasing p ower The extensive set of revision op erations implemented in its revision rule base com

bined with the wide coverage of its syntactic grammar endows streak with high syntactic paraphrasing

p ower This p ower is illustrated on Run and Run given in app endix C Among previous systems only

ana fo cused on paraphrasing p ower streak improves over ana in this resp ect in two dierent ways First

by covering syntactic constructions not covered in ana such as innitive clauses relative clauses app ositions

and nominalizations cf Kukich p Second and more imp ortantly it is able to convey the same

fact at a variety of linguistic ranks including b elow the clause rank For example the same fact expressed

by a clause in Majerle came o the bench to score points can b e alternatively conveyed by a single

word in Reserve Majerle scored point Such alternative paraphrases cannot generated by a macro co ded

generator where sentence sub jects and sentence predicates are hardwired in the lexicon and realize mutually

exclusive classes of facts This limitation thus applies to b oth ana and semtex

The MeaningText Theory provides a comprehensive framework for handling paraphrasing For example

the alternative b etween fullverb clauses and supp ortverb clauses which is captured in streak by the

revision rules nominalization and adjunctization can b e mo deled in the MTT by the lexical functions

O per agentoriented supp ort verb and S action nominal eg S to decrease decrease and

1 0 0

O per decrease to show However it seems that this high paraphrasing p otential of the MTT has

1

not yet b een fully exploited in the various implementations of the theory Since there is only one realization

12

Ignoring that some of these choices may b e interdep endent and hence somewhat reducing this total numb er

Initial draft

The one dimensional zero based array of n elements FLAG represents the ag There are three array

markers L M and R standing for left standing for middle and standing for right resp ectively L is

initialized to M is initialized to R is initialized to n

Final draft

The ag is represented by a dimensional based array of n elements FLAG There are three array markers

L M and R standing for left middle and right resp ectively L and M are initialized to R is initialized

to n

Figure Example text generated by Yh initial and nal drafts from Gabriel

pattern p er concept combinations in the target meteorological sublanguage of fog cf Polguere p

no paraphrasing p ower is implemented for that system The paraphrasing p ower implemented in gossip

and lfs relies on lexical functions but is limited to alternatives necessary to avoid rep etitions within the

same multiclause sentences cf Carcagno and Iordanska ja p

Related work in revisionbased generation

In itself the idea of revision in generation is not new It has b een prop osed in dierent avors by Mann

Vaughan and McDonald Meteer Yazdani Gabriel Wong and Simmons

Cline and Nutter and Inui et al However only two implemented generation system

emerged from these prop osals Gabriels Yh and Inui et als weiveR

In what follows I rst briey describ e these two systems and compare the typ e of revisions they p erform

to those implemented in streak I then review the nonimplemented prop osals put forward by the other

authors cited ab ove The typ es of revisions describ ed in these prop osals dier from those implemented in

streak in terms of the typ e of representation that is revised at what level to revise and the

goals of the revision why revise I compare revision in streak with previous prop osals along these two

dimensions in turn

Yh

Yh Gabriel explores a numb er of interesting language generation issues on the toy domain of the

Dutch national ag game Yh takes as input an abstract description of a lisp program playing this game

and generates a paragraph describing the co de Since Yh describ es the code itself and not its execution as

in Davey for example and since Yh do es not include a mo dule p erforming domain reasoning Gabriel

p the input to Yh is the same for all its runs Dierent outputs are obtained by playing with the

value of adjustable parameters dening stylistic heuristics used during content organization and realization

Yh is implemented in an ob jectoriented fashion The exp ertise concerning various generation subtasks is

pro cedurally encapsulated in the metho ds of dierent ob jects

13

Yh is related to streak in that it works in two passes and p erforms b oth incremental generation and

revisions However it is during the initial draft building pass that Yh works incrementally In contrast

streak works incrementally during the subsequent draft revision pass Moreover the typ e of revisions that

Yh p erforms during the second pass are contentpreserving streak p erforms contentadding revisions For

example revisions in Yh involve passivization ellipsis and changing two conjoined clauses sharing the same

verb and ob ject but each with a dierent sub ject into a single clause with a conjoined nominal as sub ject

Such revisions do not make the draft more informative but only more concise or coherent An example

initial draft generated by Yh is given in Fig It is followed by the nal draft once revised by Yh

13

It therefore is related to the systems and prop osal surveyed in section

streak works incrementally starting from a draft representation that is already a complex structure

This complex structure is built prescriptively and guides the incremental revision pro cess In contrast Yh

works incrementally from scratch and the utterance structure progressively emerges from lo cally planned

and realized additions Although providing for maximum exibility this typ e of incremental addition from

scratch is underconstrained and may lead to the exploration of a very large search space This exp ensive

search is avoided in Yh by relying on knowledge extremely sp ecic to the tiny domain of the Dutch National

Flag program It could not b e avoided in a more complex domain

weiveR

weiveR Inui et al generates paragraphs that provide information for library users such as the

librarys schedule or the lo cation of a particular b o ok It takes as input a semantic tree whose arcs are

rhetorical relations indicating the high level organization of the text and whose no des are the content units

to convey This semantic tree is thus similar to the SSS layer in streak except that its structure do es

not yet reect the lowlevel organization of content units inside phrasal constituents To build a rst draft

weiveR traverses this semantic tree topdown progressively replacing semantic subtrees by corresp onding

fully lexicalized syntactic trees To p erform this mapping weiveR relies on the sentence level content

organization heuristics describ ed in Scott and Souza and a phrasal lexicon similar to the one describ ed

in Jacobs The nal draft representation consists of a hybrid tree whose toplevels are structured

in terms of rhetorical relations and whose b ottomlevels are structured in terms of syntactic dep endencies

Each substitution of a semantic tree by a syntactic tree during the mapping ab ove corresp onds to one of the

p ossible realization options enco ded in the sentence planning heuristics and phrasal lexicon Each syntactic

tree is annotated with the semantic tree that is realizes and the option realization that it represents After

the full draft is built each syntactic tree corresp onding to a sentence is linearized

The syntactic tree of each sentence is then examined by an evaluation comp onent which computes stylistic

parameters such as the lexical lengths and syntactic depths of various constituents When this comp onent

detects a constituent that is to o long or to o deep weiveR backtracks to cho ose another sentence planning

option resulting in an alternative draft weiveR keeps track of the history of the options considered over

multiple revision cycles weiveR also includes a facility for allowing a human editor to trigger similar back

tracking when she detects an ambiguity in the latest draft generated by weiveR The issue of ambiguity

is crucial in this work b ecause weiveR generates Japanese a language where constituent ordering is very

underconstrained by syntax which leads to numerous semantic ambiguities This asp ect of weiveR under

lines the interesting prosp ect of the revisionbased approach for integrating language generation facilities in

a comprehensive computerassisted do cument pro duction environement

weiveR diers fundamentally from streak in that all the revision op erations it p erforms are content

preserving For example these op erations include adding punctuation reordering constitiuents or replacing

an emb edded clause by a separate sentence None of these op erations make the draft more informative

only less ambiguous and more readable Also weiveR do es not attempt summarize the content it has to

convey and is not constrained by space Finally weiveR relies on a phrasal lexicon whereas streak uses a

wordbased lexicon

At what level to p erform revision

Both Yazdani and Meteer at least in her initial pap er Vaughan and McDonald prop osed p erforming

revision from an actual natural language draft At that level revision is a threestep pro cess rediscovering

the generators input anew by interpreting its output using a textunderstanding system evaluating this

output by matching its interpretation with the communicative goals that were input to the generator and

regeneration The problem with this prop osal is that the development of a text understanding system

remains in itself such a tremendous endeavor that it cannot realistically b e contemplated as a substep in

the development of a revision comp onent for generation at least for generationonly applications like rep ort

generation Using an existing natural language understanding system is not an option either b ecause each

of such systems works only for its very sp ecic domain sublanguage and application Although some cross

domain robustness has b een achieved by some parsing systems their output is merely a partial parse tree

annotated with some standard semantic features which is far from the kind of thorough domain and goal

based interpretation needed to guide revision This is why the other authors prop osed as I do to p erform

revision directly on some representation of the draft internal to the generator This is also p erhaps why in

her thesis Meteer after having dismissed revision of an internal representation as mere optimization

claiming true revision requires analyzing an actual natural language output Meteer then prop oses to

p erform revision on the TextStructure a level of representation internal to her spokesman generation

14

system

However Meteers distinction b etween optimization and true revision is p ertinent b ecause unless an

internal representation has some sp ecial characteristics revising it may end up b eing closer to backtracking

and planelab oration during singlepass generation than it is to the draft reviewing and editing phase in

twopass generation Of course every form of revision is indeed a form of global backtracking while searching

the Ndimensional space of all generation alternatives It is imp ortant though to distinguish draft revision

from local forms of backtracking restricted to a given class of generation choices In my view revising an

internal representation is clearly distinct from backtracking andor plan elab oration if that representation

has the following characteristics it includes a layer surfacey enough to uniquely determine a natural

language utterance and it includes all unretracted intermediate decisions that ultimately led to that

surface layer

The rst characteristic distinguishes draft revision from planelab oration b ecause the latter is p erformed

on a purely conceptual representation in onetomany mapping with various natural language utterances

The second characteristic distinguishes draft revision from lo cal backtracking b ecause the whole spectrum

15

of generation decisions is aected at once by a single revision The multilayered FDs that streak uses

to represent the current draft has b oth these characteristics Its surface element the Deep Grammatical

Sp ecication DGS uniquely determines a natural language output via the defaults provided by surge for

unsp ecied input features Moreover all its unretracted intermediate decisions are compiled in the Deep

Semantic Sp ecication DSS the Surface Semantic Sp ecication SSS layers and the p ointers maintaining

the corresp ondence b etween two contiguous layers These intermediate decisions are also maintained in Yh

inside its ob ject system and in weiveR inside its history of annotated hybrid trees

Apart from distinguishing informationadding revision from planelab oration the presence of a surface

layer in the representation on which revision is p erformed is needed for another reason As already noted in

the previous section summarization inherently involves trading o the conicting goals of informativeness

conciseness and readability By multiplying the numb er of candidate facts to convey dealing with historical

information only exacerbates the tension b etween these conicting goals

But factors like conciseness and readability directly dep end on surface form and monitoring them cannot

b e done by reasoning only at higher layers In that sense Meteers TextStructure remains to o abstract

Although grammatical constituency is already decided at that level many grammatical features and op en

class lexical items with dierent stylistic impacts are not yet sp ecied The level of representation that

unequivo cally determines a natural language utterance in spokesman is the Linguistic Sp ecication input

to mumble Meteer et al Using spokesman levels of representation revising to adjust the trade

o b etween informativity conciseness and readability would require acting up on b oth the TextStructure

level and the Linguistic Sp ecication level

Without b eing sp ecic ab out the characteristics of the representation to revise Cline and Nutter

also stressed the need for a representation to reect b oth content planning and surface realization decisions

Wong and Simmons advo cate p erforming revision using a blackb oard Although they are not sp ecic

ab out what typ e of information is to b e p osted on the blackb oard during revision the inherent exibility of

such a mechanism would certainly allow manipulations to simultaneously aect several layers of decisions

In fact the multilayered FD representing the draft in streak can b e viewed as a blackb oard cf Elhadad

and Robin for a discussion of the sp ecial control mechanisms of fuf that render FDs blackb oardlike

14

Note that Meteers prop osal is for future work SPOKESMAN do es not p erform revision

15

In that sense the hillclimbing phase of KDS Mann and Mo ore is not revision but lo cal backtracking

Revising to satisfy what goals

The revision goals considered by Meteer Gabriel Wong and Simmons and Inui et

al are purely stylistic They are informationpreserving revisions to make the draft more concise

clearer or more coherent In contrast streak p erforms informationadding revisions to make the draft

more informative Cline and Nutter prop ose p erforming b oth varieties of revisions However the typ e

of informationadding revision they prop ose expanding during the revision of some unexpanded no des of the

16

textual schemata used for content planning during the rst pass op erates at the discourselevel addition

of a sentence or a paragraph whereas the revisions implemented in streak op erate at the phraselevel

addition of a clause group or even a single word

streak diers fundamentally from the the revisionbased generators discussed so far in that it views as

a way to gradually and opp ortunistically improve informative content of the rep ort while keeping it short

and concise This view brings together revisionbased generation with another line of research incremental

generation reviewed in the next section

Related work in incremental generation

Incremental generation has b een investigated from three main p ersp ectives the AI p ersp ective of Yh already

describ ed in the previous section the syntactic p ersp ective of Tree Adjoining Grammars Joshi et al

and the cognitive mo delling p ersp ective of ipf De Smedt I review research from these last two

p ersp ectives in the following sections

Incremental generation and Tree Adjoining Grammars

Among the various approaches to incremental generation either prop osed or implemented in recent years a

recurrent theme is the use of Tree Adjoining Grammars TAGs as an underlying syntactic formalism TAGs

have had a long and complex history in two adjacent research elds formal language theory and natural

language parsing b efore p eople started to try to use them for natural language generation In what follows

I therefore rst briey overview the initial motivation of TAGs as well as their evolution from their inception

to their use for generation I then contrast the various TAGbased approaches to incremental generation

with the approach presented in this thesis and implemented in the system streak

Tree Adjoining Grammars

Intro duced by Joshi TAGs Joshi et al originated in the eld of formal language theory as an

alternative to the various string rewrite formalisms such as Context Free Grammars CFGs and Context

Sensitive Grammars CSGs that had b een put forward to study the computational complexity of b oth

articial and natural languages The basic idea was twofold

Switch the building blo cks for characterizing language from at unstructured strings to tree structures

Restrict the p ossible ways of combining these building blo cks to a single op eration called adjoining

With this approach the strings of language are no longer characterized as in CFGs or CSGs by a set of

rewrite rules on a set of elementary strings Instead the strings of the language are indirectly characterized

by a set of trees derivable through adjoining from a set of elementary trees The strings are then available

by the standard lefttoright reading of the tree leaves

The adjoining op eration involves merging two trees an initial tree I containing a no de of category X and

an auxiliary A tree whose ro ot is also of category X Auxiliary trees have the sp ecial prop erty of containing

a leaf no de whose category matches that of their ro ot This distinguished leaf no de is called the foot no de of

16

Or in Hovys term expanding growthpoints Hovy I: S A: S

I1 NPi S NP VP

How many ships NP VP Iraq Aux VP

Iraq V NP had V S*

I2 attacked Ei said

I + A: S

I1 NPi S

How many ships NP VP

Iraq Aux VP

had V S

A said NP VP

Iraq V NP

I2 attacked Ei

Figure Adjoining in TAGs

the auxiliary tree Ajoining A to I consists of excising the subtree I ro oted at the no de X of I replacing

x

it by A and then inserting I at the fo ot no de of A This adjoining op eration is pictorially illustrated in

x

Fig In this example an indirect sp eech clause is adjoined to the original sentence

How many ships Iraq attacked

yielding How many ships Iraq had said that Iraq attacked

The auxiliary tree A is spliced in the initial tree I b etween the two subtrees I and I

1 2

Numerous studies of the formal and computational consequences of these two departures from previous

grammatical formalisms were then carried out It was shown that while b eing closer in terms of computational

complexity to CFGs than to CSGs TAGs can still mo del some of the phenomena of natural language syntax

that are b eyond the grasp of CFGs and previously required the much greater representational p ower of CSGs

This result triggered a wave of enthusiam for TAGs in the natural language parsing community However

researchers who attempted to use TAGs for parsing in the context of practical natural language applications

so on discovered that many of the syntactic phenomena encountered in the texts they attempted to parse

could not b e satisfactorily mo delled by the single op eration of tree adjoining

In resp onse to this problem TAGs then evolved through a long series of extensions enriching the initial

principle with various additions many of them inspired from alternative indep endently develop ed gram

matical formalisms Although the elegant simplicity of the original formalism based only on tree adjoining

and now called pure TAGs has b een lost along the way many extensions motivated on practical grounds

were then studied from the formal p oint of view This resulted in an unprecedented wealth of theoretical I2 = L1 . L2 . L3 : S

NP VP

Iraq V NP

attacked Ei

L1: NP L2: S L3: NP

Iraq NP VP Ei

V NP

attacked

Figure Substitution of minimal trees in lexicalized TAGs

results Since many of these results concern extensions to TAGs inspired from other grammatical formalisms

which had not previously b een the ob ject of such formal study the evolution of TAGs created a synergic

theoretical ground for the systematic comparison of diverse grammatical formalisms Such comparisons bring

much needed insights revealing fundamental and previously obscured dierences and similarities among these

formalisms

The rst notable extension to pure TAGs was the intro duction of MultiComponent TAGs Joshi

that allow the simultaneous adjoining to the same tree of sets of auxiliary trees with coreferring no des across

several memb ers of the set Multicomp onent TAGs improve on pure TAGs in that they can mo del the long

distance dep endencies in sentences such as Which painting did you buy a copy of

The tree for this example sentence results from the simultaneous adjoining to the initial tree for the phrase

a did you buy

with the auxiliary trees for the whphrase w hich painting

i

and the NP a copy of e with e coreferring to w hich

i i i

The second and most dramatic extension of the formalism was the intro duction of Lexicalized TAGs

Shab es et al In order to deal with lexical constraints on surface syntactic structure eg the typ e

of complements a given verb accepts and also to make the formalism more comp ositional lexicalized TAGs

depart from pure TAGs in two ways

The elementary trees are much smaller as they corresp ond to minimal structures representing the

arguments of sp ecic lexical items

The op eration of substitution is intro duced as a second way to combine elementary trees to supplement

adjoining

An auxiliary tree A with a ro ot no de of category X can only b e substituted in trees with a leaf no de of

category X The substitution itself simply consists of replacing this leaf no de by A Examples of substitions

are given in Fig It shows how the tree I which was a subtree of the pure TAG initial tree I in Fig

2

is built from the three minimal lexical trees L L and L by two subtitution op erations This lexicalist

imp ort to TAGs seems very close in spirit to the surface syntax representation of the MeaningText Theory

Melcuk and Pertsov

As a result of these two extensions TAGs could now b e used to comp ositionally derive most syntactic

structures observed in natural languages However only lexical and syntactic constraints could inuence the

course of the derivation the no des of the trees were only lab eled by a word andor a syntactic category

The need to mo del semantic and pragmatic constraints as well led to the intro duction of Functional TAGs

VijayShanker where each tree no de is asso ciated with a feature structure The functional unication

of these feature structures is used to constrain the adjoining and substitution op erations An auxiliary tree

A can b e adjoined or substituted to a no de X of a tree T only when the feature structure at the ro ot

of A unies with the feature structure at X The use of functional unication was inspired by Functional

Unication Grammars FUGs Kay However the resulting functional TAGs are even closer to Lexical

Functional Grammars LFGs Bresnan and Kaplan since they separate on the one hand dep endency

and precedence constraints represented by the trees and on the other hand the other typ es constraints

represented by the feature structures in very much the same way than LFGs In LFGs dep endency and

precedence is captured by the cstructure and the other constraints by the fstructure As explained in

App endix B FUGs represent dep endency and precedence uniformly with all other typ e constraints inside

17

the feature structures resp ectively using the cset and pattern features cf Section B of App endix B

The series of extensions ab ove resulted from the desire to adapt a formalism initially designed as a

theory of natural language comp etence to the needs of implementing grammars for parsing A second

series of extensions resulted from several attempts to adapt the formalism for the needs of implementing

grammars for generation This trend of work again originated from a pap er by McDonald and Pustejovsky

in which they p oint out the similarity b etween the adjoining op eration of TAGs and the attachement

op eration used in their system mumble Meteer et al Like surge used in this thesis and nigel

Mann and Matthiessen developp ed at ISI mumble is a p ortable syntactic pro cessing frontend for

the development of generation applications It accepts as input a skeletal lexicosyntactic tree called a

linguistic speccation and incorp orates a fairly extensive p ortion of the English grammar In constrast to

surge which enco des the syntactic grammar declaratively as an FG mumble enco des it pro cedurally as

18

a set of lisp functions nigel do es a little bit of b oth enco ding the functional asp ects of the grammar

declaratively in a system network and its structural asp ects pro cedurally as lisp functions called realization

statements

The interest of TAGs for generation continued with a pap er by Joshi comparing TAGs and CFGs

and arguing that two prop erties of TAGs extended domain of lo cality and exible wordordering sp eci

cation make TAGs more suitable to generation than CFGs Four things must b e noted concerning this

comparison

It do es not present an implemented TAG grammar for generation but rather contrast the two formalisms

in terms of their general characteristics on illustrative example sentences

CFGs have b een rarely used for the development of natural language generation systems FUGs remains

the most widely used formalism in generation An exception to this general trend may b e the

area of reversible grammars Strzalkowski Researchers in this eld attempt to circumscrib e a

minimal set of mo dications and extensions allowing the computational grammars and formalisms they

develop ed for parsing to b e reused for generation as well The hop e is that for applications such as

questionanswering systems or interlingual machine translation where b oth a parser and a generator

are needed a reversible grammar would avoid duplicating the eort of writing the grammar rules twice

19

once for each mo dule

As p ointed out in the pap er itself Joshi p the extended domain of lo cality of TAGs is also

a prop erty of FUGs The exible wordordering sp ecication prop erty of TAGs p ossible only for TAGs

of the LDLP variety discussed b elow is also shared with FUGs at least in the fuf implementation

17

The particular feature structures of FUGs are called Functional Descriptions FDs

18

With all the wellknown software engineering drawbacks of pro cedual representations

19

This agenda seems to b e strongly motivated by a desire of uniformity and formal elegance From a purely practical

engineering viewp oint the use of available standalone syntactic grammars for generation such as surge or nigel whose

coverage is already extensive and gradually b eing extended seems a costecient alternative for the development of such

systems

of FUGs using paths in the pattern metafeature as explained in Section B of App endix B It

must b e noted however that the limited p ower of TAGs at least of pure TAGs with its the desirable

consequences in terms of computational complexity is not shared by FUGs

Joshi is not explicit concerning what variety of TAGs is compared with CFGs but since he mentions

only adjoining as way of combining elementary trees it seems to describ e pure TAGs rather than

any of the extensions presented ab ove that rendered TAGs practically usable for parsing Without

the op eration of substitution that allows the derivation of syntactic structures from individual words

using TAGs for generation would require the use of a phrasal lexicon such as those of ana Kukich

phred Jacobs or pauline Hovy with the limitations of this approach with resp ect

20

to scalability discussed in section

Considering this last remark it comes as no surprise that all subsequent prop osals to adapt TAGs to the

sp ecic needs of generation used lexicalized TAGs as their starting p oint Within this line of research three

main eorts have b een

LDLP TAGs Harbusch

Synchronous TAGs Shieb er and Shab es

Systemic TAGs Yang et al McCoy et al

LDLP TAGs Lo cal Dep endence Linear Precedence Tree Adjoining Grammars depart from the regu

lar lexicalized TAGs used for parsing in that they separate the expression of dep endency constraints among

constituents from the expression of precedence constraints among these constituents These two typ es of

constraints are implicitly compiled in the syntactic trees that are the building blo cks of regular TAGs For

example the constraints enco ded by the tree L in of Fig can b e alternatively mo dularly enco ded as

follows

dominateSNP

dominateSVP

dominateVPV

dominateVPNP

dominateVattacked

precedeNPVP

precedeVNP

This departure from lexicalized TAGs can b e viewed as the logical continuation of the departure from

pure TAGs to lexicalized TAGs in the evolution of the formalism towards more comp ositionality LDLP

TAGs seems to constitute the nal step in this evolution where the formalism has gone full circle to abandon

its original distinctive characteristic the use of syntactic trees as primitive elements Indeed in the way

they separate the expression of dep endency and precedence LDLP TAGs seem to stand closer to FUGs

where dep endence is expressed using the cset feature and precedence using the pattern feature than to

other TAGs This separation in expression allows LDLP TAGs to decomp ose syntactic pro cessing into two

mo dules one sp ecifying a p ossibly partial syntactic structure and another linearizing this structure into a

linguistic string This decomp osition parallels the division in surge b etween the grammar prop er and the

linearizer cf Section B of App endix B

LDLP TAGs have b een used in the multimedia presentation system wip Andre et al Wahlster

et al This system generates b oth graphics and natural language to provide explanations on how to

op erate an esspresso machine In this typ e of application the emphasis is on the denition of the overall

system architecture and on the task of co ordinating the two media The natural language generator is

essentially used to pro duce lo cative sentences such as The ono switch is located in the upper left corner

of the picture In the comparable system comet McKeown et al which generates co ordinated textual

20

For the example sentence of Fig ab ove the two phrasal patterns S How many ships S Iraq attacked and S

1 2

Iraq had said S where the S in each pattern indicating the no de where adjoining can o ccur would need to b e stored as a

whole in the lexicon S

NPi S q F

How many ships NP VP attack Iraq q

Iraq V NP

attacked Ei

S F

NP VP say Iraq F*

Iraq Aux VP

had V S*

said

Figure Mapping rules in synchronous TAGs

and graphical explanations on how to op erate and repair a military eld radio similar lo cative sentences

were simply generated in oneshot using the fufsurge package

One of the common characteristics of b oth pure TAGs and their parsingmotivated extensions is that

they op erate purely at the surface syntactic level But a fulledged generation system must start pro cessing

from an input that is a semantic as p ossible The latest two extensions of TAGs attacked this deciency of

the formalism for generation

In Shieb er and Shab es the authors prop ose to use for generation the formalism of synchronous

TAGs that they had develop ed for interpretation A generator based on this extended version of TAGs would

accept as input a logical form structured as a semantic tree For example the input for the sentence

How many ships did Iraq said were attacked

would b e LF sayIraqquantityofshipsattackIraqquantityofships

1

Note the nonisomorphism b etween the structure of that logical form and the syntactic structure of the

corresp onding sentence the predicate which represents the main whclause is emb edded in the logical

form and the say predicate which heads the logical form surfaces as a dep endent clause in the sentence

A synchronous TAGs consists of three sets of declarative structures

Elementary lexicalized syntactic trees with adjoining and substitution anchor p oints they represent

minimal linguistic constituents

Elementary semantic trees with adjoining and substitution anchor p oints they represent minimal

logical form constituents

Mapping rules asso ciating a syntactic tree and a semantic tree and sp ecifying the corresp ondence

b etween their resp ective anchor p oints

Input speechact whquestion

whit phenomenon actor

tense past

process think

actor you

phenomenon process hit

tense past

actee john

actor type person

id question

Output Who did you think hit John

Figure Example inputoutput of a systemic TAG generator

Two example mapping rules are given in Fig Each asso ciates a logical form on the right with a

linguistic tree on the left The thick link b etween the two indicates the corresp ondence b etween the anchor

p oints where adjoining can o ccur on each side The idea of synchronous TAGs is that since these mapping

rules are reversible generation of the syntactic structure can b e carried out by parsing the logical form

using the semantic TAG Each time a semantic tree is recognized adjoined or substituted in the logical form

b eing parsed the mapping rule asso ciated with the semantic tree is red and as a side eect a corresp onding

syntactic tree is also chosen adjoined or substituted Generation of the linguistic output thus b ecomes a side

eect of parsing the logical input For example parsing the tree corresp onding to logical form LF ab ove

1

would result in the generation of the syntactic tree I A in Fig Although very ingenious and app ealing

in theory it remains to b e seen whether this approach will turn out to b e practical for the implementation

21

of a generation application

Systemic TAGs fully integrate the formalisms of systemic grammars and tree adjoining grammars A

22

generator based on a systemic TAG accepts as input a feature structure such as the one given in Fig

It is similar to the input accepted by surge in that it is structured in terms of thematic roles inspired

from systemic linguistics It diers in that it also sp ecies the closedclass words that surge cho oses on its

own The constituent structure of the input is traversed topdown starting from the toplevel clause and

recursing on sub constituents Each constituent is then passed through a systemic network enco ding syntactic

alternatives such as the various mapping of thematic roles into syntactic roles dep ending on constraints like

dative or voice The traversal of a subnetwork results in

The selection of an elementary TAG tree realizing the input constituent

The next subnetworks to enter

A set of features to propagate to this next networks

The networks of systemic TAGs are thus exactly similar to those of nigel Mann and Matthiessen

a p ortable syntactic frontend for generation applications based solely on systemic grammars The dierence

b etween the two lies in the way they build the syntactic structures satisfying the contraints accumulated

during the systemic network traversal For this task nigel recourse to lisp pro cedures whereas a systemic

TAG uses the adjoining and substitution op erations of TAGs In that resp ect systemic TAGs presents the

advantage of b eing declarative instead of pro cedural However a generator based on systemic TAGs relies

on an heterogeneous set of three declarative knowledge sources the input feature structure the systemic

network and the TAG trees It thus requires the implementation of three distinct interpreters one for each

21

Given the tricky control issues generally raised by the detailed pro cedural executions of reversible declarative mo dels

22

Adapted to the fuf notation used for feature structures in this thesis

source plus a global control mechanism to control their interaction This is in contrast with the simplicity

and ecomony of a FUGbased generator such as fuf which relies only on the single op eration of topdown

recursive unication to p erform the same task

Contrast b etween STREAK and TAGbased approaches to incremental generation

In the previous section I overviewed TAGs as a grammatical formalism and whenever relevant contrasted

them with FUGs the grammatical formalism used in this thesis I also surveyed the approaches to incremen

tal generation prop osed or implemented by the prop onents of TAGs In this section I fo cus on contrasting

these approaches with the approach to incremental generation implemented in streak It diers from the

TAGbased approaches with resp ect to three fundamental asp ects

Monotonicity of the incremental generation pro cess

Levels of pro cessing incrementally p erformed

Emphasis of the research eort

I elab orate on each of these asp ects in what follows

Monotonicity

The most imp ortant dierence b etween streak and TAGbased approaches to incremental generation is

that the latter are limited to monotonic op erations on the sentence structure under construction Both the

adjoining and substitution op erations of TAGs only further elaborate the sentence structure while preserving

the expression of its original content As opp osed to streak TAGbased generators never genuinely revise

the sentence structure altering the expression of its original content in order to accomo date a new fact

One of the most signicant contribution of this thesis is precisely to show that relying only on elab oration

only do es not allow a generator to pro duce the most condensed linguistic forms needed in summarization

applications such as newswire stories The present thesis extends the scop e of incremental generation research

to encompass work on revisionbased generation

In spite of b eing hailed as esp ecially suited for incremental generation TAGs were delib erately designed

to b e monotonic cf Shieb er and Shab es p They therefore do not present any advantage over

other monotonic grammatical formalisms for the development of a nonmonotic opp ortunistic generator like

streak Implementing streak with TAGs would have required working out yet another extension of the

formalism The grammatical formalism chosen for the development of streak FUGs was also originally

monotonic The extension of the formalism to suit the sp ecial needs of nonmonotonic applications is

describ ed in Section B of App endix B The advantage of FUGs over TAGs for the development of

streak lays primarily in the availability of the fufsurge package It allowed for a declarative and

uniform implementation of all the comp onents and provided an extensive initial syntactic coverage It is the

declarative nature of surge that rendered practical the extensions of its coverage to the target sublanguage

23

of streak In constrast the pro cedural nature of mumble the only available TAGbased generation

24

grammar of English with a signicant initial coverage would have made such extensions hardly p ossible

for an outsider

In terms of sp ecic transformations the adjoining op eration of TAGs corresp onds to the following set

of streak revision op erations

25

The cases of adjoin where a constituent of higher syntactic rank is adjoined to a constituent of

strictly lower rank eg adjoin of relative clause to NP but not adjoin of temp oral NP to clause

The cases of absorb where the syntactic category of the absorb ed and absorbing constituents match

23

And b eyond as describ ed in Section B of App endix B

24

At the time I started the implementation of streak

25

Calling this transformation adjoin when it precisely corresp onds a TAG op eration that is not adjoining is admittedly

confusing However I have to stick to this denomination for the sake of consistency with my previous publications

conjoin and append

The substitution op eration of lexicalized TAGs corresp onds to the remaining cases of adjoin and

absorb No TAG op eration corresp onds to any of the nonmonotonic revision op erations of streak

Levels of incremental pro cessing

streak also dier from TAGbased approaches to incremental generation in that it handles three levels of

pro cessing incrementally the deep semantic level the surface semantic level and the deep syntactic level

The revision to ols of streak simultaneously alter the DSS deep semantic SSS surface semantic and DGS

deep syntax layers of sentence representation Among TAGbased approaches to incremental generation

the only one where transformations also simultaneously aect several layers of sentence representation is

synchronous TAGs Both the logical form which represents the sentence at the surface semantic level and

the TAG tree which represents it at the surface syntactic level are incrementally pro cessed Compared to

streak it is thus more sup ercial and comprises only two layers The three other approaches mumble

systemic TAGs and LDLP TAGs seem to incrementally manipulate only the surface syntactic representation

of the sentence

Research emphasis

A third dierence b etween the research presented in this thesis and related work on TAGbased incremental

generation lies in their resp ective emphasis and orientation Research revolving around TAGs tend to

emphasizes formalism over data the same few example sentences are found meticulously reanalyzed over

mulitiple publications by the prop onents of each new mo dication of the formalism In constrast with its

extensive set of revision op erations and its wide coverage of syntactic forms in surge the research

presented in this thesis emphasizes data over formalism

With it wealth of formal denitions TAGcentered research is also theoretically oriented New prop os

als tend to b e evaluated for the elegance or computational prop erties of their formal characterization In

constrast with its p ervasive reliance on corpus data the present research is empirically oriented It evalu

ates the new concepts it prop oses by implementing a practical generation application summarization and

contextualization of quantitative data and by crossexamining corpus data across domains

The research presented in this thesis is thus b est viewed as bringing insights to the problem of incremental

generation that nicely complement those of TAGcentered research

Incremental generation in IPF

Aside from the TAGbased approaches another line of research in incremental generation is due to De Smedt

and Kemp en and De Smedt It is encapsulated in the system ipf Incremental Parallel Formula

tor ipf and streak investigate incremental generation topics that are almost entirely mutually exclusive

They aim at dierent typ es of output mo del dierent levels of incremental pro cessing and enco de dierent

sets of transformations

ipf is a simulation system whose goal is to provide an exp erimental testb ed for cognitive mo dels ab out

the pro duction of spontaneous speech with sp ecial emphasis on errors and hesitations This goal thus

stands in sharp contrast with that of streak which aims to mo del the pro duction of careful ly planned and

edited written texts This dierence in output medium has consequences in terms of output complexity In

sp ontaneous sp eeches sentences tend to b e short since they are limited by the shortterm memory of the

sp eaker Thus ipf fo cuses on simulating the pro duction of simple sentences such as Otto eats an Apple

or John and Mary seem to b e at the party these are example of errorfree sentences examples simulating

errors are given later on In constrast streak fo cuses on mo delling the planning and editing of the lengthy

written sentences used for summarizing and contextualizing complex events

As input ipf takes the equivalent of a skeletal lexicalized syntactic tree but sp ecied pro cedurally as a

lisp function Except for its pro cedural enco ding this input thus contains the same information as the inputs

to surge or mumble In streak all the incremental pro cessing concerns building such a skeletal lexicalized

syntactic tree Once built this tree is then nonincrementally pro cessed by surge Thus while streak

works incrementally at the deep semantic surface semantic and deep syntactic levels ipf functions at the

surface syntactic level Among systems sp ecializing at that level ipf mumble and surge form a continuum

The sole goal of ipf is cognitive mo delling which can b e done with its small grammar In constrast the

sole goal of surge is p ortability across dierent practical generation applications which requires a wide

syntactic coverage mumble attempts to reconcile b oth goals within a single system Within the grammar

ipf separates the expression of dep endency and precedence constraints and is lexically driven It thus seems

very close to the LDLP lexicalized TAGs used in wip and discussed in the previous section

In De Smedt and Kemp en the authors distinguish b etween six typ es of incremental additions to

26

sentence structure during sp eech pro duction

upward expansion where the added fragment dominates the base fragment eg John and Mary

are

downward expansion where the base fragment dominates the added fragment eg John and Mary

are at the party

insertion where the added fragment splices the base fragment eg John and Mary seem to b e at

the party

coordination where the base and added fragments are at the same structural level eg John Peter

and Mary and Anne

reformulation where part of the base fragment is contradicted by the added fragment eg You

should have sent that letter uh handed it over

lemma substitution where part of the base fragment is sp ecialized by the added fragment eg Do

you really want to buy that record uh compact disc

There are two things to note ab out these incremental additions First the selfcorrections contained in

the last two typ es of additions are semantically nonmonotic while remaining syntactically monotonic Such

a discrep ency is p eculiar to sp ontaneous sp eech and not available for written texts None of the additions

ab ove are both semantically and syntactically nonmonotonic as the most sophisticated revision op erations

implemented in streak

Second additions such as allow b oth the input and output of the addition to b e syntactically incom

plete hence the name fragment This is also excluded for written texts In streakb oth the input and

output of a revision are required to b e complete grammatical sentences

Aside from these imp ortant dierences it is still p ossible to establish some corresp ondence b etween these

additions and the op erations used in streak and TAGbased approaches upward expansion corresp onds to

some absorb cases in streak and to lowering attachment in mumble downward expansion corresp onds

to adjoin in streak and to substitution in lexicalized TAGs insertion corresp onds to some other

absorb cases in streak to splicing attachment in mumble and to adjoining in TAGs Co ordination

corresp ond to coordinative conjoin in streak

In conclusion ipf brings insights to the problem of incremental generation that complement b oth those

of streak and of TAGcentered research

Related work in evaluation for generation

Previous work in evaluation metho ds esp ecially quantitative ones for language generation is very scarce

The reasons for this scarcity were discussed in section In most cases dierent approaches are compared

26

In each example the fragment added by the increment is highlighted in b old and indicates an hesitation

only qualitatively on a few wellchosen examples The output of a generation system is considered satisfying

when it pro duce grammatical sentences that are judged semantically accurate for the restricted sublanguage

of the application Issues such as coverage extensibility and p ortability are discussed only qualitatively

and rarely in much detail The two notable exceptions to this general trend are the dissertations of Kukich

Kukich and Elhadad Elhadad b

Kukich dedicates a full chapter to evaluation She denes and estimates several quantitative parameters

These parameters compare the knowledge structures abstracted during a single round of corpus analysis with

the knowlege structures actually implemented in the generator ana They thus quantitatively measure the

coverage of one particular implementation with resp ect to one sample of the target sublanguage This is in

sharp contrast with the coverage evaluation presented in this thesis which evaluates a generation model the

revisionbased micro co ded approach and estimates the inuence of sublanguage sample size on knowledge

acquisition in general Kukich also discusses in some detail the samedomain extensibility and p ortability

of the onepass macro co ded generation mo del that she prop oses However this discussion remains only

qualitative In contrast I carried out a quantitative evaluation of these two prop erties for the revisionbased

micro co ded mo del that I prop ose

Elhadad briey discusses evaluation in the conclusion of his thesis One of the three main contributions

of his work is the development of the resp onse generator for the questionanswering system advisorI I

The system assists students planning their schedule for a semester One the many imp ortant novelties of

advisorI I is its ability to generate linguistic constructs that express a sub jective evaluation of the ob jective

content also conveyed in the resp onse These constructs that Elhadad calls evaluative expressions include

argumentative connectives eg although judgment determiners eg lots of and scalar adjectives

eg dicult Elhadad quantitatively evaluates the imp ortance of such expressions by measuring their

frequency of o ccurrence in a corpus advising session transcripts Such measurement is similar to the p er

centage of newswire summary sentences containing oating concepts andor historical information given in

Chapter of the present thesis As an evaluation these frequencies concerns only the issues addressed by

the advisorI I not the solution that the system brings to these issues Elhadad evaluates the solution only

in qualitative terms This is in contrast with the present thesis where both the issues addressed and the

solutions prop osed are quantitatively evaluated

Chapter

Conclusion

To conclude this thesis I rst revisit the contributions it makes to several subelds in natural language

generation I then discuss the limitations of the present work and the future directions in which it could b e

extended

Contributions

In Section I intro duced the contributions this thesis makes to the eld of natural language generation

I now revisit these contributions sp ecifying which of the following ve subareas of language generation they

are part of summarization applications system architecture revision and incremental pro cessing evaluation

and p ortable syntactic frontends

Contributions to summary rep ort generation

The ma jor contribution of this thesis to summary rep ort generation is that it is the rst to address to the

three following issues

Providing the historical background for the summarized event

Planning and realizing sentences as complex and informative as human writers

Performing linguistic summarization b elow the sentence rank

streak is the rst summary rep ort generator to provide the historical background of the new facts

it rep orts It thus not only summarizes a particular event but also contextualizes it This constitutes a

signicant step towards bridging the gap b etween computergenerated rep orts and their humangenerated

counterparts Conveying historical information allows for a much broader coverage A generator ignoring

such information could at b est cover of the sentences from the corpus of humanwritten summaries I

analyzed Taking into account the historical context also allows for a smarter choice of the new facts to

rep ort since the relevance of a new fact is largely dep endent on its historical signicance

streak is also the rst generator that delib erately attempts to pack as many facts as p ossible within a

single sentence As a result streak generates sentences up to the maximum complexity level observed in

humanwritten summaries words long levels deep conveying facts Previous work fo cused on the

generation of paragraphs where each sentence was signicantly simpler not over ab out words levels

deep and less informative no more than facts

Finally streak is the rst generator p erforming linguistic summarization at the group and clause ranks

In previous work this issue was investigated only at the sentence rank The combined abilities to generate

more complex sentences and to convey facts by adding only a few words in well chosen constituents allows

streak to pro duce more concise summaries conveying more information in some cases twice as much in

the same limited space

Another contribution of this thesis is the corpusbased approach underlying the development of streak

This stepbystep metho dology can b e followed for the development of rep ort generators in any application

domain It facilitates the four most delicate phases in the development of a generation system

Architecture design as explained in Chapter

Knowledge acquisition as explained in Chapter

Evaluation and testing as explained in Chapter

Denition of stylistic preferences

The last p oint dening stylistic preferences to cho ose among the dierent valid expressions of each

concept combination in a given application domain is one of the most elusive generation problem Without

the guidance of corpus data all grammatically correct forms need to b e considered and solid evidence

supp orting any preference criteria is generally lacking The corpusbased approach oers a very practical

answer to this problem consider only corpus o ccurring forms and cho ose b etween them in a semirandom

fashion mimicking their resp ective frequency of o ccurrence in the corpus

Contributions to generation architecture

The contribution of this thesis to generation architecture is the design of a new text generation mo del where a

rst draft conveying obligatory content with little expressive variation is incrementally revised to incorp orate

supplementary content with much expressive variation This new mo del improves on previously prop osed

ones in that

It makes the generation process more exible by allowing the decisions of which supplementary content

gets included in the generated text as well as where and how they are conveyed to b e made under

a combination of semantic lexical syntactic and stylistic constraints In particular it allows using

linguistic factors to monitor content planning a requirement when concisely expressing the content of a

whole paragraph in a single sentence without letting that sentence grow so complex as to b e unreadable

This added exibility thus allows the generator to express itself with a uency more comparable to

that of human writers

It makes the generation process more compositional in two orthogonal ways by building sentences

incrementally through a set of revision cycles and by separating the syntagmatic asp ect of content

realization ie the organization of content units inside the sentence from its paradigmatic asp ect

ie the lexicalization of the individual units This added comp ositionality facilitates the acquisition

and extension of the generators knowledge sources

It is more cognitively plausible for the generation of complex written sentences as explained in Sec

tion

This new mo del was shown op erational by the implementation of the system streak It was also shown to

b e far more robust than the traditional macro co ded oneshot generation mo del In particular a quantitative

comparison of these two mo dels cf Section revealed that in the average the revisionbased mo del

multiplies coverage by over and improves extensibility by almost a third

Another interesting asp ect of this new generation mo del is that it combines a unique set of prop erties

whose desirability had b een indep endently advo cated in previous work

It distinguishes b etween foreground content to convey obligatorily and background content to convey

opp ortunistically Rubino

It realizes oating facts across dierent linguistic ranks Elhadad b

It can conate the expression of several facts within a single word Elhadad b

It uses declarative knowledge sources Nogier Polguere

It is straticational and mo dular Polguere

It uses a uniform underlying formalism functional descriptions to represent utterances at all levels of

abstraction and all linguistic ranks Simonin Dale Elhadad b

It interleaves planning and realization App elt Hovy

Contributions to revisionbased generation and incremental generation

The ma jor contribution of this thesis to revisionbased generation and incremental generation is the extensive

set of revision rules to incrementally incorp orate quantitative and historical data in a given draft expressing

related information These revision rules constitute a new typ e of linguistic knowledge integrating semantic

lexical syntactic and stylistic constraints It is the rst set of linguistic resources sp ecically geared towards

incremental language generation to b e based on a systematic negrained corpus analysis These revision

rules were classied in a hierarchy of classes and subclasses Their op erationality was demonstrated

by their implementation as a functional unication grammar in streak Their relevance to quantitative

domains other than the sp orts domain in which they were acquired was demonstrated by the fact that

examples of their use were found in the nancial domain for all toplevel classes and for a ma jority of the

b ottomlevel classes

The other contribution that this thesis make to these two lines of research is to bring them together for

the rst time Previous work on revisionbased generation was limited to contentpreserving transformations

and previous work on incremental generation which was limited to monotonic draft elab orations This thesis

shows that in order to concisely accommo date a new content unit onto a draft sentence it is at times necessary

to reformulate the original content of this draft through revisions

Contributions to evaluation in generation

Another ma jor contribution of this thesis is that it addresses the issue of evaluation in generation to a far

greater extent than any previous work In particular the three evaluations describ ed in this thesis constitute

the rst attempt to

Quantify how much of a given sublanguage can b e captured by various knowledge structures used for

language generation and acquired by analyzing a sample of this sublanguage It shows that in the

average almost of a year sample of the sublanguage can b e covered by the revision rules acquired

using a sample from another year

Quantitatively comparing dierent generation mo dels and measuring the gain of mo dularity It shows

that revisionbased generation is ab out times more robust and a third more extensible than oneshot

generation

Quantitatively assessing the degree of domaindep endence of knowledge structures used for generation

and acquired from texts in a single domain It shows that ab out of the revision rules observed in

sp ort rep orts are also used in sto ck market rep orts

Contributions to SURGE

The contributions of this thesis to the p ortable syntactic generation frontend surge is to have extended its

wide coverage at two essential linguistic ranks nominals and complex sentences At the complex sentence

rank I have added semanticrolesyntacticcategory realization pairs to the original pairs of the

adverbial subgrammar This extension resulted b oth from the addition of new semantic roles and from

a b etter coverage of the realization options for the original semantic roles At the nominal rank I have

added realization pairs to the that surge already covered

Since surge already had a wide coverage at the determiner and simple clause ranks surge thus

has a wide coverage at all four main linguistic ranks This extended coverage combined with the ease of use

that it derives from the uniformity bidirectionality and declarativity of the underlying functional unica

tion formalism makes it the b est p ortable syntactic pro cessing frontend for the development of generation

applications available to day Since its development surge has b een used as the syntactic frontend in

three generation applications other than streak

Limitations and future work

In this section I survey the limitations and future agenda of the research presented in this thesis I rst discuss

the limitations of the fufsurge software starting with the implementation related ones which aect the

p erformance of streak and continuing with the more design related ones which aect the versatility of the

package indep endently of its particular usage for the development of streak I then discuss the limitations

of the current implementation of streak not related to its reliance on fuf I then depart from the fo cus of

this thesis and discuss in what directions it could b e extended These directions have essentially two dierent

asp ects addressing the whole range of language generation b eyond the phrase level content planning

and content realization subtasks on which this thesis fo cused and carrying out further evaluations

Limitations linked to the use of FUF

Sp eeding up the revision pro cess

One of the main limitations of the current implementation of streak is that it is quite slow This is esp ecially

true for the generation of sentences approaching the complexity limit observed in the corpus Example run

presented in Section where streak incrementally generates a word sentence of syntactic depth

through ve revision steps to ok for example hr min sec All the run times discussed here were

obtained with a compiled version of the lucid common lisp co de of streak including the fufsurge

package and p erformed on a sparc sun workstation In what follows I briey discuss the main source of

this ineciency and what could b e done ab out it Fortunately it turns out to largely result from an artifact

of the current implementation of fuf which was inconsequential for the monotonic computations for which it

was originally intended but which renders it very inecient for the nonmonotonic computations p erformed

in streak Several pro jects currently under way at Ben Gurion University of the Negev are exploring various

strategies to reimplement fuf more eciently As we shall see they should bring tremendous improvement

to the run time of streak

In example run mentioned ab ove the generation of the initial basic draft b efore revision starts to ok

only sec It is thus clearly the way revision is implemented that slows down the overall generation pro cess

However it is not so much the numb er of revisions involved that is a factor as it is the complexity of

the sentences b eing revised Example run commented in Section C of app endix C involving seven

revisions completed in only min sec The dierence b etween run and run is that in run revisions

where chained to generate an increasingly complex sentence whereas in run each revision was carried out

indep endently on the same basic draft to demonstrate the paraphrasing p ower of the system

streak is thus fast in generating draft sentences yet slow to revise them esp ecially as they b ecome more

complex This observation p oints to the fuf functions that were sp ecially develop ed to p erform the cut and

paste op erations needed for implementing nonmonotonic revisions as the probable source of this ineciency

These functions presented in Section B of App endix B are insertfd which pastes a smaller FD inside

a larger FD as a subFD under a given path and relocate which conversely cuts as a selfcontained smaller

FD a subFD under a given path inside a larger FD I measured the resp ective prop ortion of total run

time sp ent in these two functions on example run The amount of time sp ent in calls to insertfd was

negligible In contrast the amount of time sp ent in calls to relocate was tremendous it accounted for

of the total run time Without these calls this total run time could b e brought down under min

When and why is relocate called during the course of the incremental revision pro cess and could it b e

FD b a

d e a c

a c

FD b c

d e b c

a b

FD b a

a c

d e b c

Figure Canonic and noncanonic list representations of the same FD

avoided To answer these questions it is necessary to recall that the building blo cks of the revision pro cess

are the following revision actions intro duced in Section

addfd which pastes using insertfd an additional feature whose value is not a path inside the

revision workspace

addpath which pastes using fu an additional equation b etween two features inside the revision

workspace

delfd which removes using topgdpp a feature inside the revision workspace

cpfd which pastes in the revision workspace a copy of another feature in that workspace the copy is

obtained using relocate and the pasting is done using insertfd

mapfd which calls the phrase planner or lexicalizer on a copy of a given feature in the revision

workspace and then pastes the result of the call to another lo cation in the workspace the copy is

obtained using relocate the phrase planner or lexicalizer maps this copy using unifd and the

mapping result is pasted using insertfd

It is also necessary to recall that there are several p ossible list representations for a given FD For example

the three lists in Fig represent the same FD In contrast an FD is uniquely represented by a graph The

graph corresp onding to the lists of Fig is given in Fig

FD in Fig is the canonic list representation all its paths are direct and its nonpath values are

0

lo cated under the shortest or in case of a tie the rst in alphab etical order path among the equated

features So for example the value is placed under a c rather than under d e that also p oints to it

in the graph FD is noncanonic b ecause physical representatives are not all under the canonic path For

1

example is placed under b c instead of a c violating the alphab etical convention FD is noncanonic

2

b ecause it contains an indirect path b c where the value of b is itself a p ointer a

The version of fuf that I used to implement streak represents FDs as lists The revision actions

listed ab ove app ear in the Right Hand Side of streaks revision rules Before the revision rule interpreter

applies any of these actions it is imp erative that the list L representing the FD which enco des the current

d

draft b e in canonic form Only such form allows predicting by lo oking at the features in a path whether the

value down this path in L is itself a path or not Without such knowledge it would b e imp ossible during the

d

writing of a revision rule to determine whether the appropriate action to apply under a given path should

b e an addfd or an addpath This diculty is illustrated by the set of revision actions to build the canonic

FD from scratch

0

addfd c a

addpath a b

addpath a c d e

These have nothing in common with the corresp onding the set of revision actions to build the noncanonic

FD representing the same FD

1

addpath b a

addfd c b

addpath b c d e

Though writing revision rules requires assuming that they will b e applied to a canonic form FD the

version of fuf in which streak was implemented the unication op eration do es not preserve the

canonic FDs b eing unied This unfortunate prop erty is illustrated by the following example run

setf FD fu FD f a c

b a

d e a c

a c

f a c

setf FD canonize FD

b a

d e f

a c f

f

In FD the physical represent of the class of equivalent paths d e a c f is not lo cated

under the shortest path of the class f but instead under the longer a c FD is thus noncanonic

The canonic list representation of the FD resulting from this unication op eration is FD Since several

revision actions such as addpath and mapfd rely on unication after each such action the draft must b e

recanonized b efore the application of the next action This canonization is done by calling relocate which

is guaranteed to pro duce a canonic FD as explained in Section B of App endix B on the draft FD

with a empty ie relo cation path

During the application of a revision rule relocate is thus called in two dierent typ es of circumstances

To canonize the onionlayer FD representing the whole draft after the application of an action that

may have altered its canonicity

a b d

c e

1

Figure Unique graph corresp onding to the lists of Fig

To cut from the whole draft structure a substructure which constitutes the scop e of a cpfd or mapfd

revision action

In example run of the overall run time is sp ent in relo cations called for canonization purp oses and

in relo cations called for the purp ose of cutting subFDs Skaliar Skaliar describ es an ongoing

eort to implement a new version of fuf which internally represents FDs directly as quotient sets as opp osed

to lists as in the current version Such graphbased version would prevent the need for recanonization thus

already more than halving the worstcase run time of streak Furthermore as explained in Section B

of App endix B relocate consists of three internal steps

Convert the larger FD to quotient set form

Compute the quotient set of the subFD under the cutting scop e

Convert back the subFD quotient set into list form

With a graphbased version since all FDs are already in quotient set form only the second step is

required Since this second step is of negligible duration compared to the two others the remaining run time

sp ent for relo cation with the current listbased version would also virtually disapp ear Using such a version

the run time for the most complex sentences would thus fall down to under min

This last run time would b e for an interpreted graphbased version Two other pro jects currently at their

inception involve the development of two compilers for fuf programs one whose ob ject co de is compiled

commonlisp co de and another whose ob ject co de is compiled C co de The compiled lisp version is exp ected

cf Elhadad p ersonal communication to reduce the average run time of a typical fuf program by and

the compiled C version by With these exp ected optimizations the run time of streak to generate the

most complex sentences observed in the corpus of humanwritten summaries would fall to the order of the

Note that for the generation of written rep orts much slower run time is acceptable in the context of

practical applications for which the architecture of streak was designed Written rep orts can b e generated

oline Even though there is some time pressure involved in generating texts such as newswire stories it

is in no way comparable to the pressure involved in other generation applications such as the pro duction of

resp onses in an interactive dialog system

Negation and variable attributes

The revision rules of streak are as they stand already very generic This genericalness is achieved by two

prop erties of streaks architecture The rst prop erty is the topdown traversal of the draft constituent

structure built in the revision rule interpreter This traversal allows the revision rules to b e lo cal to a

given typ e of constituent indep endently of where this constituent app ears in the particular draft structure

at hand The second prop erty is the mapfd action that allows revision rules to sp ecify only where and

how to attach the new constituent expressing the content unit added to the draft by the revision leaving

the resp onsibility of sp ecifying the internal structure and linguistic form of this new constituent to phrase

planning and lexicalization rules

Two asp ects of fuf as the underlying programming language of streak prevent the expression of even

more generic revision rules The rst asp ect is the absence of an elegant way to express negative constraints

in fuf and consequently in the LHS of streaks revision rules The only way to enco de such constraints

in the current implementation is to rely on the pro cedural metaattribute control as in the example LHS

b elow

Additional game statistic by a different player

than the one with which it is coordinated

lhs CONTROL not eq getadsscarrier n

getconjoinedstatcarrier

Had fuf a metaattribute to express negation the constraint ab ove could b e expressed declaratively by

a path inequality instead of pro cedurally via the calls to the getadsscarrier and

getconjoinedstatcarrier functions

Another asp ect of fuf that limits the genericalness of streaks revision rules is the lack of either wild

cards in paths or of a facility to dene a hierarchy over attributes similar to the definefeaturetype

construct for the denition of a hierarchy over atomic values

Consider the following example disjunction

alt lhs bls partic range GIVEN

rhs cpfd bls partic range predmodif instrument np

lhs bls partic created GIVEN

rhs cpfd bls partic created predmodif instrument np

lhs bls partic affected GIVEN

rhs cpfd bls partic affected predmodif instrument np

which enco des one action of the various sp ecializations of the revision rule adjunctization of object as

instrument in terms of the thematic role lled by the ob ject b eing adjunctized The same constraint could

b e more generically expressed without this alternation by dening the roles range created and affected

as sp ecialization of a generic object role for example as follows

defineattributetype object range created affected

lhs bls partic object GIVEN

rhs cpfd bls partic object predmodif instrument np

Unfortunately the denition of an attribute hierarchy would require fundamentally rethinking the seman

tics of the unication op eration For example given a hierarchy dened as

defineattributetype a a a

what should b e the result of the unication of a a with a

Should it b e a a a fail something else A careful examination of such fundamental

semantic questions should b e carried out b efore any attempt to implement this particular extension of the

formalism It thus raises dicult design issues In contrast the addition to fuf of such metaattribute for

negation seems a simple implementation matter

Limitations prop er to STREAK

Extending the ontology of quantitative domains

The sp orts domain ontology covered by streak is restricted to three main classes of information

Box score statistics data summarizing the various asp ects of the overall p erformance of a player or a

team during the rep orted game eg Dennis Ro dman grabb ed reb ounds

Historical records the prop erty of one such statistic to constitute a record over some p erio d of time

extending at least across several games eg Shaquille ONeal scored a seasonhigh p oints

Historical streaks the prop erty of a p erformance to extend or interrupt a series of similar p erformances

by the same team or player eg the Golden State Warriors defeated the Charlotte Hornets

for their seventh straight win

As explained in Section these three classes were chosen for two reasons First they are not p eculiar

to sp orts Any quantitative domain has content classes which closely parallels them For example nal

game statistics in the sp orts domain directly corresp ond to nal sto ck index values in the nancial domain

and streaks of victorydefeat for a sp orts team directly corresp ond to streaks of raisedecline for a nancial

index Second they are the most frequently rep orted facts in the corpus of humanwritten summaries that

underly the development of streak in of the corpus sentences b ox score statistics were the only

typ e of nonhistorical information and in of them records and streaks were the only typ e of historical

information

It would b e interesting to extend the ontology of streak to less frequent typ es of b oth historical and non

1

historical information which also have the prop erty of b eing p ervasive in multiple quantitative domains

Notably the extended ontology could cover the following additional classes

Nonhistorical records similar to historical records but within to the temp oral scop e of the rep orted

event eg Hakeem Ola juwon scored a game high p oints

Nonhistorical streaks similar to historical streaks but within to the temp oral scop e of the rep orted

event eg scored all nal p oints for Houston

Near records statistic whose value comes close to equalling a record eg Ola juwon added assists

one shy of his career b est

Near streaks set of similar events who quasitotality are of the same typ e the San Antonio Spurs

defeated the Utah Jazz for their victories in the last games

Historical counts absolute numb er of p erformance ab ove a given statistical threshold eg Scottie

Pipp en recorded his eighth career tripledouble with p oints reb ounds and assists

Developing global backtracking facilities

streak consists of four indep endent mo dules the phrase planner the lexicalizer the syntactic grammar

and the reviser In the rst three the internal control mechanism is the topdown recursive functional

unication built in fuf The control mechanism of the reviser is more complex First a revision cycle is

divided into two stages the search for an applicable rule followed by the application of the found rule when

the search was successful The outer control lo op of the rst stage consists of a topdown traversal of the

draft structure The inner control lo op for each sub constituent in this structure consists of the nonrecursive

functional unication also built in fuf of the revision rule base with the pair

current draft additional content unit

The control mechanism of the second stage is the simple traversal of the revision action list asso ciated

with the rule found during the rst stage Some elements in this list can b e a mapfd action which triggers

a call to either the phrase planner or the lexicalizer in the current revision context The overall control

mechanism for the generation of a sentence is quite complex it consists of an initial drafting pass where

phrase planner and lexicalizer are called in turn followed by a revision pass made of a numb er of revision

cycles

The uexhaust fuf function can b e used to trigger systematic automatic backtracking separately in the

phrase planner lexicalizer and syntactic grammar However the current implementation do es not include a

facility for systematic automatic backtracking neither for the reviser nor for the overall generation system

During a given revision cycle streak always applies the rst applicable rule it nds and moves on to the

next cycle A useful extension of the implementation would b e to develop facilities to backtrack

Within a single revision cycle

Over a whole revision pass made of several revision cycles

Over the whole generation pro cess encompassing b oth the draft and revision passes from a given pair

Fixed facts DSS stack of oating fact DSSs

A facility to backtrack over one revision cycle would allow exhaustively demonstrating and testing the

revisionbased paraphrasing p ower enco ded in the system It would allow for example generating the com

plete list of revised drafts p ossible from a given pair additional content unit current draft Maintaining

1

Though p ossibly necessary for a realworld application extending the ontology to less frequent content classes p eculiar to

sp orts would not from a purely research standp oint present any particular interest

the current p oint in the search space explored by such a backtracking facility is far from trivial This is

b ecause there are four distinct sources of nondeterminism at play during a revision cycle

The various draft sub constituents onto which the new constituent can b e attached

The various revision rules that can b e used to p erform this attachment on a given sub constituent

The various phrase planning rules that can b e used to dene the internal structure of this new con

stituent

The various lexicalization rules that can b e used to realize each element in that internal structure

A slightly mo died version of the uexhaust fuf function could serve as the building blo ck for the

implementation of such a backtracking facility in streak

A facility to backtrack over a whole revision pass encompassing several revision cycles would ensure that

the maximum numb er of oating facts incorp orable within a given draft gets included in the nal summary

With the current implementation this is not guaranteed When several revision rules are applicable during

revision cycle N the rst one found by streak may not b e the one whose application would result in the

most concise revised draft It is nonetheless applied It then b ecomes p ossible that during cycle NM

only one revision rule is applicable but it would result in a revised draft b eyond the maximum complexity

observed in the corpus of humanwritten summaries In this case streak will not b e able to incorp orate

this last oating fact due to lack of space However it is conceivable that had an alternative revision rule

resulting in a more concise form b e applied at cycle N the complexity threshold would not b e reached at

cycle NM With a backtracking facility across several revision cycles it would b e p ossible to undo a choice

of a revision rule at a given cycle and resume the revision pro cess from this p oint in an attempt to include

more oating facts

Finally a systemwide backtracking facility covering b oth the drafting and the revision passes would

allow exhaustively demonstrating and testing the overall paraphrasing p ower enco ded in the system This

p ower is very high due to the multiplicity of paraphrasing sources at work during the generation of a given

sentence There are two such sources at play at draft time alternative phrase planning rules and alternative

lexicalization rules And there are four sources at play at each revision cycle alternative draft constituents

on which to p erform the revision alternative revision rules and again alternative phrase planning rules

and lexicalization rules for realizing the oating fact in the context of the revision For example supp ose

that for a given pair Fixed facts DSS Stack of oating fact DSSs there are only two alternative

phrase planning rules and lexicalization rules available b oth at draft time and at each revision cycle Further

supp ose that at each cycle there are also only two alternative draft constituent on which to apply only

two alternative revision rules Even in that case a sentence generable in revision cycles would nonetheless

5

have paraphrases That so many forms can b e generated from the

same semantic representation with only rules attests to the extreme

comp ositionality of the generation approach implemented in streak

Implementing fact generation and discourse planning

As explained in Section the overall text generation task comprises the following subtasks

Content pro duction retrieving all the facts to p otentially rep ort

Content preselection cho osing a restricted set of candidate facts

Content nal selection picking the candidate facts to b e included in the summary

Discourse level content organization assigning facts to sentential slots in the summary

Phrase level content organization assigning facts to sentence constituents

Lexicalization cho osing the op enclass words of each sentence

Semantic grammaticalization cho osing the syntactic structure of each sentence

Morphosyntactic grammaticalization applying grammar rules inecting op enclass words and cho os

ing closedclass words

Linearization sp elling out the inected words in order

In chapter I prop osed in Section a complete generation architecture for a system p erforming

all these tasks However in the rest of the thesis I fo cused on tasks and to The current version of

streak implements only these tasks In the following subsections I briey discuss the interesting issues

raised by extending this implementation to tasks and as well

Starting from the raw numb ers

Currently streak accepts as input two sets of conceptual networks resp ectively representing the xed facts

to convey obligatorily and the oating facts to convey opp ortunistically Transforming streak from the

research prototyp e that it currently is into a complete generator summarizing and contextualizing basketball

games directly from raw quantitative data would require implementing an additional mo dule in the architec

ture prop osed in Section the fact generator The input to this fact generator is a b oxscore containing

the nal statistics of the game In addition this mo dule has to have access to a historical database ab out

basketball The generation tasks to b e implemented by this mo dule are content pro duction and content pre

selection It would p erform conceptual summarization complementing the linguistic summarization already

implemented in the current version

It would probably b e b est decomp osed into sp ecialized comp onents resp ectively resp onsible for

Reading the input table and pro ducing a record for each statistic

Querying the database for related historical statistics

Reformatting each statistic in the symb olic form that the phrase planner and reviser need as input

Discriminating b etween xed and oating facts

Computing the relevance grade of each oating statistic in the input table and each historical fact

related to it

The task of the rst three is fairly straightforward and would probably b e b est implemented as C pro

grams In contrast the task of the last two involve enco ding domain exp ertise and would b e b est implemented

declaratively p ossibly as fuf programs

From a research p ersp ective the implementation of these comp onents do es not seem to raise any esp ecially

challenging issue except p erhaps the assignment of relevance grades However the availability of such mo dule

in streak would allow the carry out an ambitious systematic evaluation of the overall implemented system

by running it on the b oxscores of the new games played everyday

Multisentential revisionbased generation

Currently streak summarizes its input data by a single complex sentence It generates the typ e of lead

sentence observed in the corpus of humanwritten newswire rep orts As explained in Section such

sentences were chosen as the fo cus of this thesis b ecause they themselves summarize the rest of the rep orts

The most intriguing research direction to follow within the new revisionbased framework put forward in this

thesis is to aim at generating whole newswire rep orts made of multiple sentences Extending streak to carry

out such a task would require implementing the last mo dule in the architecture prop osed in Section

the discourse planner

In itself the implementation of such a discourse planner should not p ose any fundamental diculty Since

within the architecture prop osed in this thesis the discourse planner needs to handle only the organization

of xed facts oating facts b eing handled by the reviser the standard technique of textual schemas

McKeown would b e very appropriate The limitation of this technique is precisely its diculty to

cop e with oating facts

Instead the diculty of moving from sentence generation to multisentential text generation with a draft

and revision approach lies in how this change aects the task of the reviser It raises a set of new issues

that I now review

Adequately revising a multisentential draft would involve identifying and enco ding discursive constraints

on the maximum sentence complexity allowed for each sentential slot in the rep ort structure Even though

it would allow for maximal conciseness the strategy of generating a rep ort consisting exclusively of very

complex sentences would b e stylistically inappropriate In humanwritten summaries complex sentences

alternate with simpler ones probably to avoid taxing the readers concentration excessively When they

conict what is the b est tradeo b etween informativeness and stylistic appropriateness

In a multisentential setting the control mechanism of the revision pro cess needs to traverse two lists

instead of one the list of draft sentences and the list of oating facts Which of these two traversals should

b e the outer lo op of the reviser

Though I have not veried in the corpus whether it is indeed the case or not in a multisentential setting

it b ecomes p ossible that some obligatory facts are oating ie they are found in every corpus rep ort but

in dierent sentential slots in dierent rep orts Would these facts b e b est handled by the reviser or by the

discourse planner

It may also b e the case that some sentential slots in the rep ort structure have no xed facts asso ciated

with them What should constitute the anchor p oint for initiating the use of such sentential slots empty

in the rst draft to convey a oating fact

Given the much larger search space of alternatives provided by a multisentential setting how much and

what typ e of backtracking should b e allowed without rendering the overall generation pro cess inecient

Currently the only goal of the reviser is to make the draft more informative by incorp orating new

oating facts As explained in Section this is b est done at revision time b ecause constraints on such

incorp orations are b oth nonlo cal and heterogeneous semantic syntactic stylistic and discursive The non

lo cality is not a problem for the reviser since it can search the entire draft structure for a constituent on which

to apply a revision rule The heterogeneity is also not a problem since all typ es of constraints are captured

by the multilayered draft representation that the reviser manipulates Constraints on pronominalization

and abridged forms of subsequent references are also nonlo cal and heterogeneous They could also b e b est

handled by the reviser In other words default forms of reference could b e systematically generated at

draft time and then carefully abridged at revision time This new agenda for the reviser raises the following

question should the resp ective application of reference abridging and information adding revision rules b e

separated or interleaved and how

Further evaluations

One last interesting direction in which the research presented in this thesis could b e pursued is to carry out

further evaluations

The evaluations presented in this thesis fo cused on quantitatively evaluating b oth the new generation

model and the new linguistic know ledge structures onto which the streak system is based However the

implementation itself was not directly quantitatively evaluated Because in its current version streak

accepts as input handco ded symb olic inputs as opp osed to the raw numb er tables available online and

do es not include a facility to trigger systemwide backtracking it has only b een tested on a representative

yet limited set of inputs and not all p ossible outputs have b een generated from each of them

While streaks accuracy is largely insured by the fact that streak is based exclusively on constraints

individually observed in the corpus some constraint combinations which have not b een prob ed by the current

test set could p ossibly yield to p o or wording or style This is the unavoidable downside of the robustness

gained through the use of declarative partial constraints comp ositionally combined through functional uni

cation

Interactions b etween lexical and stylistic factors may b e esp ecially tricky to unearth as shown by the

following constraint discovered with the existing test set for the nominal expression of a streak

extended their winning streak to straight game

extended their winning streak to straight

extended their winning streak to consecutive game

extended their winning streak to consecutive

Even though in the four sentences ab ove the adjectives straight and consecutive are synonymous and

o ccupy the same head noun premo difying syntactic function straight can b e used in conjunction with

ellipsis of the head noun but consecutive cannot Since unfelicitous forms such as sentence ab ove never

app ear in the corpus foreseeing the idiosyncratic constraint interaction from which they result b efore testing

is very dicult Fortunately while the declarative and comp ositional framework of functional grammars may

lead to some overgeneration problems due to such unforeseen constraint interactions it also makes their

correction easy once detected In the example ab ove all that was needed was a single line of additional co de

testing the premo difying adjective when deciding whether to elide the head noun

As noted earlier a prerequisite to systematic implementation testing is the development of the fact

generator that would make streak a complete generation system going all the way from the raw statistics

to the natural language text summarizing and contextualizing them The same evaluation metho dology

based on acquisition vs test corp ora used in this thesis at the more abstract level of knowledge structures

could b e used at the more detailed level of actual system runs For example it would b e interesting to

systematically compare over a whole season of new rep orts the output that the system pro duces from the

b oxscore for each game with the corresp onding rep orts pro duced by humanwriters

Finally consider again the p ortability evaluation of the revision rules used by streak presented in

Section It measured the degree of domain dep endence of these rules It would also b e interesting to

evaluate their degree of language dep endence This would require developing a multilingual version of the

system It is therefore a rather long term goal if only b ecause of the overhead involved in developing a

syntactic grammar equivalent to surge in at least one other language than English It could nonetheless

bring imp ortant insights in terms of exactly what typ e of constraints should b e enco ded at each layer of the

utterance representation internally used by the generator

App endix A

A Complete Inventory of Revision

Op erations

The corpus analysis presented in Chapter resulted in a hierarchy of the revision op erations The top of this

hierarchy shown again in Fig A distinguishes b etween monotonic revision tools which consist of attaching

a new constituent to the draft while preserving the surface form of its original content and nonmonotonic

revision tools which involve mo difying the expression of the original content in order to accommo date the

additional content in the draft syntactic structure The b ottom distinctions in this hierarchy sub divide

each class of revision tools into a set of revision operations A revision op eration is a revision to ol p ossibly

accompanied by a side transformation which corrects whatever information redundancy ambiguity or invalid

lexical collo cation the revision to ol application may have intro duced

In Chapter only one class of monotonic revision to ol one class of nonmonotonic revision to ol and

one side transformation were discussed in detail In this app endix I give a detailed presentation of all

the revision to ols and side transformations Moreover I present a more complete version of the revison

op eration hierarchy The hierarchy outline in Section results from the original corpus analysis covering

only a single season of basketball rep orts The hierarchy presented in this app endix also includes the new

revision op erations discovered during the analysis of two additional seasons of basketball rep orts which was

p erformed for coverage evaluation purp oses as explained in Section Finally while the revision examples

in Chapter come only from the basketball domain this app endix also gives example from the sto ck market

domain which as used for the p ortability evaluation presented in Section

For each class of revision to ol in g A I provide the following

Its general revision schema

How it sp ecializes into sub classes down the revision op eration hierarchy the numb er of corpus o ccur

rences of the most sp ecic classes are given b elow them

Monotonic Revisions Non−monotonic Revisions

AdjoinAbsorb Conjoin Append RecastAdjunctization NominalizationDemotion Promotion

Figure A Revision op eration hierarchy toplevels

An example of draft and revised phrases illustrating the transformations involved in the revision

For each sub class a surface decrement pair from the basketball corpus ie two semantically and

syntactically minimally diering phrases from which the revision op eration has b een identied

When the to ol is p ortable a corresp onding surface decrement pair from the sto ck market domain

The last three items are also provided for each side transformation

A revision schema is a gure with two generic syntactic trees The left tree shows the draft substructure

common to all applications of the revision to ol before revision it is called the base structure and the right

tree shows the same substructure after revision it is called the revised structure The following pictorial

conventions are used in these revision schemas

Constituents are circles or ovals

Structural relationships are lines

1

The lines corresp onding to role relations are lab eled

The Paratactic relation b etween two constituents is represented by grouping them under a pseudo

2

head tagged by an sign

The elements added by the revision either constituents or relations are b oldfaced

Similarly font conventions are used in the source and target phrase examples from the corpus

The added constituents are in b old

The deleted constituents are in italics in the source phrase and where relevant their lo cation is marked

by the symb ol in the target phrase

The displaced constituents are in smallcaps except for numeric constituents which are under

lined

are underlined The constituents which swapp ed syntactic category

When surrounding context is necessary the phrase sub ject to the revision is bracketed

Co ordinated elements are bracketed

The lo cation of implicit constituents elided controlled etc is marked the symb ol

A Monotonic revisions

Pro cedurally a revision to ol is dened by a set of transformations to change the draft structure into the

revised structure A monotonic revision to ol consists only of an introductory transformation which attaches

to the draft the new constituent realizing the additional content while preserving the draft content realization

As shown in g A I identied four main typ es of monotonic revisions Adjoin Absorb Conjoin

and Append They dier from each other in terms of either the typ e of the base structure on which they

can b e applied or the typ e of revised structure they pro duce I present each of them in turn in the following

subsections

A Adjoin

Adjoin was already presented in Section It applies only to hypotactical ly structured bases and

consists of the intro duction of an additional optional constituent A A is adjoined to the base constituent

c c

under the base head B The revision schema of an Adjoin is shown in Fig A

h

1

ie hyp otactic relations either argument or adjunct

2

In co ordinations this pseudohead is realized by a conjunction in app ositions simply by a comma or a similar punctuation

mark Base Structure Revised Structure

Bh Bh

Adjunct

Bc1 . . . Bcn Bc1 . . . Bcn Ac

Figure A Adjoin schema

The pair of phrases b elow illustrates how Adjoin can b e used to add a record information to a game

statistic nominal

source Armon Gilliam scored p oints

target Armon Gilliam scored a franchise record p oints

The noun comp ound franchise record corresp onding to A in the schema is simply added as a pre

c

mo difying classier of the p oints nominal corresp onding to B in the schema This is a case of

c

Adjoin of Classifier

Adjoin is a versatile to ol The variety of Adjoin revision op erations is shown in Fig A The analyzed

corp ora contained cases where the new constituent was added to a nominal abbreviated NP in the revi

sion hierarchy and others where it was added to a clause abbreviated S in the revision hierarchy When

adjoined to a nominal the new constituent could ll the following syntactic functions partitive classier

describ er and qualier For the qualier syntactic function the added constituent came in two syntactic

forms nonnite clause and relative clause Finally a relative clause could express a given typ e of additional

information equally well when adjoined to dierent draft sub constituents of the same syntactic category

nominal but emb edded at dierent levels in the draft structure For example to add streak information

to the draft phrase

to p ower the Golden State Warriors to a triumph over the Los Angeles Clipp ers

the same revision to ol Adjoin Relative Clause to Nominal can b e applied either to the emb edded nom

inal referring to the losing team underlined and yielding

who lost for to p ower the Golden State Warriors to a triumph over the Los Angeles Clipp ers

the ninth straight time

or alternatively to the toplevel nominal conveying the game result underlined as a whole and yielding

to p ower the Golden State Warriors to a triumph over Los Angeles that extended the Clip

p ers losing streak to nine games

The rst revision is thus called Adjoin of Relative Clause to BottomLevel Nominal and the second

Adjoin of Relative Clause to TopLevel Nominal

When adjoined to a clause the new constituent could ll the following syntactic functions frequency

result time and coevent with only a single syntactic category for the adjoined constituent in each case

Adjoin was used to add b oth typ es of historical information streaks and records as well as nonhistorical

information It was accompanied by three typ es of side transformations reference deletion reference abridg

ing and constituent reordering see Section A for a discussion of these side transformations

A surface decrement pair for each sub class of Adjoin encountered in the corp ora follows

Adjoin of Classier NP−Adjoin S−Adjoin

partitive classifier describer qualifier origin frequency result time co−event 1 25 5

relative−S Non−finite−S PP PP Non−finite−S Finite−S Non−finite−S 1 4 1

top bottom

abridged full deleted +reorder full abridged full abridged abridged full deleted ref ref ref ref ref ref ref ref ref ref

20 1 3 9 2 2 10 13 3 12 2 1

Figure A Adjoin revision op eration hierarchy

source sports Patrick Ewing scored p oints

target sports Armon Gilliam scored a franchise record p oints

source finance Volume climb ed to million shares

target finance Volume climb ed to a record billion shares

Adjoin of Describ er

source sports the Boston Celtics

target sports the hotsho oting Boston Celtics

source finance the Hong Kong market

target finance the buoyant Hong Kong market

Adjoin of Relative Clause to TopLevel Nominal

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports to pace the Houston Ro ckets to a triumph over the Chicago Bulls that

completed a twogame sweep

source finance the market got supp ort from overseas markets

target finance the market got an initial b o ost from overseas markets which proved

strong for the second straight day

Adjoin of Relative Clause to TopLevel Nominal with Abridged Reference

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports to lead the Cleveland Cavaliers to a win over Detroit that gave the

Pistons their third straight defeat

source finance sto ck prices eased on some prot taking on the Thailand sto ck exchange

target finance sto cks eased on some prot taking following the markets recent rally which

lifted prices to ve consecutive record close

Adjoin of Relative Clause to TopLevel Nominal with Deleted Reference

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports to pace the to a triumph that sent the Min

nesota Timb erwolves to their sixth straight defeat

Adjoin of Relative Clause to BottomLevel Nominal

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports the Gold Index lost p oints to

source finance to give the Philadelphia ers a victory over the Miami Heat who lost

their sixth straight game

target finance the closely watched German Sto ck Index which gained p oints Mon

day to set its fth straight record high lost p oints to

Adjoin of Relative Clause to BottomLevel Nominal with constituent reordering

source sports and the Jazz routed the Minnesota Timb erwolves

over the Miami Heat who have target sports and the Boston Celtics triumphed

lost consecutive road games

Adjoin of Nonnite Clause to Nominal

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports to lift the Chicago Bulls to a victory over the Milwaukee Bucks

running their winning streak to six games

Adjoin of Nonnite Clause to Nominal with Abridged Reference

source sports leading the New York Knicks to a victory over the Charlotte Hornets

target sports lifting New York a triumph over the Los Angeles Lakers snapping

the Knicks two game losing skid

Adjoin of partitive to Nominal

source sports Bill Laimb eer scored p oints

target sports Derrick Coleman had of an NBA team record blo cked shots

Adjoin of Frequency PP to Clause

source sports as the New York Knicks routed the Philadelphia ers

target sports as the Pho enix coasted past the Washington Bullets for its fth

straight victory

source finance while the Gold Index added p oints to

target finance while the Amex Market Value Index inched up to for its sixth

straight advance

Adjoin of Origin PP to Clause

source sports Bill Laimb eer scored p oints

target sports contributed p oints o the b ench

Adjoin of Frequency PP to Clause with Abridged Reference

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports to lift New Jersey to a triumph over the Jazz for the Nets rst

win at Utah since

source finance consumer sp ending edged up p ercent

target finance consumer sp ending rose p ercent in the indexs sixth consecutive rise

Adjoin of CoEvent Clause to Clause

source sports helping the Indiana Pacers cruise past the Washington Bullets

target sports helping the Los Angeles Clipp ers defeat the Knicks snapping a

game losing streak

source finance the key Straits Times Industrials Index soared p oints to

p oints

target finance the blue chip Hang Seng Index soared p oints to snapping

its three session losing streak

Adjoin of CoEvent Clause to Clause with Abridged Reference

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports helping the Milwaukee Bucks to a victory over San Antonio snapping

the Spurs eight game winning streak

Adjoin of CoEvent Clause to Clause with Deleted Reference

source sports and the Boston Celtics triumphed over the Miami Heat

target sports and the Golden State Warriors triumphed snapping the Boston

Celtics game home winning streak

Adjoin of Nonnite Result Clause to Clause

source sports to spark the Chicago Bulls past the Seattle Sup ersonic

target sports leading the Los Angeles Clipp ers past the Indiana Pacers to snap a

seven game losing streak

source finance sto cks continued to suer from b outs of prot taking

target finance the over the counter market caved in to b outs of prot taking to snap its

six session winning streak

Adjoin of Time Clause to Clause with double abridged reference

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports to lead Utah to a trouncing of Denver as the Jazz defeated the

Nuggets for the th straight time at home

source finance Trading volume was a busy millions shares

target finance Volume amounted to a solid million shares as advances outpaced

declines to

A Absorb

The general revision schema of Absorb is shown in Fig A Absorb consists of replacing the base constituent

B by a new hypotactic structure in which B app ears as the jth sub constituent B is thus absorbed by this

c c c

new hyp otactic structure which o ccupies whatever p osition B o ccupied in the base structure Absorb diers

c

from the others monotonic revision to ols in that it is the only one which results in a revised cluster with an

hyp otactic relation b etween the base constituent and the added constituent with the added constituent as

head

The pair of phrases b elow illustrates how Absorb can b e used to add record information to a game statis

tic nominal

source Larry Bird scored p oints Monday night including seven pointers

target Larry Bird scored p oints Monday night including matching his own club record with

seven pointers

The nominal seven p ointers corresp onding to B in the schema is absorb ed by the added clauses

c

matching his own club record and attached to it via the prep osition with as an instrument adjunct The Base Structure Revised Structure

Ah

Bc Ar1. . . Arj . . . . . Arm

Ac1 . . . . . Bc . . . . . Acm

Figure A Absorb schema

NP−Absorb S−Absorb

in−NP in−S in−S

as−qualifier as−instrument as−affected−apposition as−mean as−co−event

12 1 3 1

Figure A Absorb revision op eration hierarchy

clause resulting from this attachment then gets itself attached under the toplevel clause Larry Bird scored

p oints Monday night including and replaces in this clause the original absorb ed nominal seven

p ointers This is thus a case of Absorb of Nominal into Clause as Instrument Adjunct

The variety of Absorb revision op erations are given in Fig A Like Adjoin Absorb is a versatile to ol

The analyzed corp ora contained cases where the absorb ed constituent was a nominal and others where it

was a clause A nominal could b e absorb ed either by another nominal or by a clause When absorb ed

by another nominal it is always lling the qualier syntactic function ie the absorbing and absorb ed

nominals were linked by a prep osition When absorb ed in a clause the nominal could app ear either as an

instrument adjunct or as an element of an app ositive aected argument In this latter case absorption is

thus indirect via an intermediate paratactic complex the app osition In the corpus all the absorb ed clauses

were absorb ed in another clause app earing either as a mean adjunct or as a coevent adjunct Finally only

one absorb sub class was sometimes accompanied by a side transformation abridged reference Absorb of

Nominal in Clause as Instrument Adjunct

A surface decrement pair for each sub class of Absorb encountered in the corp ora follows

Absorb Nominal in Nominal as Qualier ie as PP complement

source sports Patrick Ewing scored points

target sports Ricky Pierce scored a p ersonal season high of points

source finance The bluechip Financial Times sto ck index p oints to

target finance The bluechip Financial Times sto ck index climb ed another p oints

to a new high of

Absorb of Nominal in Clause as element of app ositive aected argument

source sports the Los Angeles Clipp ers recorded a franchise record reb ounds in a

victory over the Denver Nuggets

target sports the Houston Ro ckets held Charles Barkley to in winning their fourth

straight game a triumph over the Philadelphia ers

Absorb of Nominal in Clause as an Instrument Complement

source sports scored p oints Saturday night including a for

performance from the line

target sports Larry Bird scored p oints Monday night including matching his own

club record with seven pointers

source finance The Korean Comp osite Index p osted a point rebound

target finance Prices plummeted another p oints on Saturday but b egan showing

signs of recovery on Monday with a steep point rebound

Absorb of Clause in Clause as CoEvent Adjunct with Agent Control

source sports as the New York Knicks routed the Philadelphia ers

target sports as the Boston Celtics won their fourth straight game downing the

Houston Rockets

source finance the Dow transportation average fell points to

target finance the Dow transp ortation average retreated for the second straight ses

sion falling another points to

Absorb of Clause in Clause as Mean Adjunct with Agent Control

source sports helping Detroit beat Indiana

target sports to help the Charlotte Hornets break a vegame losing streak by

holding off the Cleveland Cavaliers

source finance the market rose to points Tuesday

target finance the market tantalized punters by rising above

A Conjoin

Conjoin applies to almost any base structure The only restriction on its application is that the base structure

has to b elong to a syntactic category that can participate in either a co ordination or an app osition The

general revision schema of Conjoin is given in Fig A Conjoin consists of replacing the base constituent

B by a new paratactic structure grouping this base constituent with the additional constituent A B and

c c c

A are thus conjoined in this new paratactic structure Conjoin diers from the two preceding to ols Adjoin

c

and Absorb in that it pro duces a paratactic structure Base Structure Revised Structure

&

Bc

Bc Ac

Figure A Conjoin schema

The pair of phrases b elow illustrates how Conjoin can b e used to add a streak information to a game

result clause

C to enable the Philadelphia ers to defeat the Atlanta Hawks

0

to enable the Philadelphia ers to defeat the Atlanta Hawks and win their sixth C

1

a

straight game

The sub ordinate clause to defeat the Atlanta Hawks corresp onding to B in the schema

c

b ecomes the rst element of a co ordination of two sub ordinate clauses replacing the original sub ordinate

clause in the main clause The second clause win their sixth straight game corresp onding to A in the

c

schema in the co ordination conveys the additional streak information This is thus a case of Coordinative

Conjoin of Bottom Level Clause

The variety of Conjoin revision op erations is given in Fig A Both nominals and clauses can b e

conjoined The English grammar provides two distinct ways of building a paratactic complex co ordination

3

and app osition In a co ordination the elements are connected by a co ordination conjunction and they can

b e semantically related in various and p otentially lo ose ways In an app osition the elements are simply

juxtap osed one after the other and separated only by some punctuation mark The app osited elements need

to b e semantic coreferents The analyzed corp ora sublanguage did not contain cases of clausal app ositions

As for Adjoin it is p ossible to add the same information to the draft by conjoining the new constituent

with draft constituents lo cated at dierent depths in the draft structure For example consider again adding

streak information to the draft phrase C ab ove Instead of conjoining it with subordinate clause to defeat

0

ab ove it is also p ossible to conjoin it with the embedding clause the Atlanta Hawks yielding C

1

a

yielding

to enable the Philadelphia ers to defeat the Atlanta Hawks and record their sixth C

1

b

straight victory

This is a case of Coordinative Conjoin of Top Level Clause Similarly the same information can b e

attached by using app osition on toplevel nominals ie those app earing directly as a clause argument or

adjunct or on b ottom nominals ie those emb edded as nominal mo diers

Like Adjoin and Absorb some sub classes of Conjoin are p ossibly accompanied by reference abridging or

deletion as side transformations Two other side transformations scope marking and verb adjustment are

prop er to Conjoin see Section A for a discussion of these side transformations In the analyzed corp ora

Conjoin was used to add b oth historical and nonhistorical information

A surface decrement pair for each sub class of Conjoin encountered in the corp ora follows

TopNominal App ositive Conjoin

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports leading the Denver Nuggets to their rst win o the season a

victory over the Minnesota Timb erwolves

source finance the Amex Market Value Index climb ed to

3

e and or or but NP−Conjoin S−Conjoin

appos coord coord

top bottom top bottom

abridged full deleted +verb +scope full abridged full abridged ref ref ref adjust mark ref ref ref ref

15 5 4 1 11 2 2 5 1 1

Figure A Conjoin revision op eration hierarchy

target finance the Amex Market Value Index inched up to its ninth con

secutive advance

TopNominal App ositive Conjoin with Abridged Reference

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports lifting the Golden State Warriors to a victory over Milwaukee the

Bucks seventh straight road loss

TopNominal App ositive Conjoin with Deleted Reference

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports to lead the Detroit Pistons to their th straight victory over the

Sacramento Kings a decision

BottomNominal App ositive Conjoin

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports to p ower the Golden State Warriors to a triumph over the Los

Angeles Clipp ers losers of nine straight games

source finance the market is overcoming the weakness in IBM

target finance weakness in IBM a market leader and key dow comp onent

Nominal Co ordinative Conjoin

source sports a victory over Philadelphia their rst ever win against the ers

target sports a victory over Milwaukee their rst ever over the Buck and

Milwaukees th straight road loss

source finance durable go o ds orders slid to a billion in July the largest decrease

since orders fell in Decemb er

target finance durable go o ds orders rose to billion the indexs rst gain since

February and the indicators largest increase since a spurt last De

cemb er

Nominal Co ordinative Conjoin with Scop e Marker

source sports leading the Portland Trail Blazers to a triumph over the New York

Knicks their seventh straight victory

target sports leading the Utah Jazz to a victory over New Jersey their third

straight win overall and th straight over the Nets

source finance Alcoa p osted a prot of cents a share

target finance the bus company rep orted a fourth quarter prot of a share and

overall earnings of million or a share

Nominal Co ordinative Conjoin with Emb edding Verb Adjustment

source sports Patrick Ewing scored p oints

target sports Armon Gilliam had p oints and reb ounds

TopClause Co ordinative Conjoin

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports to lead the Cleveland Cavaliers to a victory over the Miami Heat and

break a three game losing streak

source finance prot taking sent prices lower on the Thailand Sto ck Exchange

target finance a sharp decline in b ond prices sent sto cks lower and halted the Dows

three day record setting march towards the barrier

TopClause Co ordinative Conjoin with Abridged Reference

source sports to lead the New York Knicks to a victory over the Charlotte Hornets

target sports to lead the Miami Heat to a victory over New Jersey and snap the

Nets three game winning streak Base Structure Revised Structure

& &

Bc1 . . . Bcn Bc1 . . . Bcn Ac

Figure A App end schema

BottomClause Co ordinative Conjoin

source sports helping Detroit b eat Indiana

target sports to enable the Philadelphia ers to defeat the Atlanta Hawks and

win their sixth straight game

source finance as signs of an improving economy forced b ound prices to give up initial

gains

target finance as computerguided sell programs and prottaking teamed up again to wip e

out initial gains and deal the Dow its fourth straight loss

BottomClause Co ordinative Conjoin with Abridged Reference

source sports helping the Chicago Bulls defeat the Seattle Supersonics

target sports to help the Denver Nuggets defeat Dallas and deal the Maver

icks their th straight loss

A App end

Append applies only to draft constituents that are already paratactically structured As shown in its revision

schema Fig A Append simply consists in adding a new element A inside such a structure

c

The pair of phrases b elow illustrates how Append can b e used to add a third game statistic to a nominal

conjunction already conveying two statistics by the same player

contributed p oints and reb ounds

Benoit Benjamin contributed p oints reb ounds and six blo cked shots

The nominal six blo cked shots corresp onding to A in the schema is app ended to the nominals

c

p oints and reb ounds In the corp ora analyzed Append was used with b oth nominals and clauses

However it o ccurred only for co ordinations and not for app ositions This is probably due to the general rarity

of app ositive constructions with more than two elements in English In some cases Append to Co ordinate

Clauses was accompanied by an Ellipsis sidetransformation see Section A for a discussion of these cases

The subhierarchy of Append revision op erations is thus trivial and is shown in Fig A together with

the subhierarchy of Nominalization revision op erations In the analyzed corp ora Append was used only

to add nonhistorical prop ositions

A surface decrement pair for each sub class of App end encountered in the corp ora follows

App end to Co ordinate Clauses

source sports David Robinson scored p oints and pulled down reb ounds

target sports Akeem Ola juwon scored p oints grabb ed reb ounds and blo cked

four shots

App end to Co ordinate Clauses with Ellipsis

source sports Bill Laimb eer scored p oints and Mark Aguirre added

target sports Willie Anderson scored p oints and David Robin

son

App end to Co ordinate Nominals

source sports Armon Gilliam had p oints and reb ounds

target sports Benoit Benjamin contributed p oints reb ounds and six blo cked

shots

A Nonmonotonic revisions

In nonmonotonic revisions a single transformation cannot account for the structural dierences b etween

the base pattern and the revised pattern The intro ductory transformation attaching the new constituent

onto the base pattern is accompanied by restructuring transformations necessary to accommo date the new

constituent in the revised structure

As shown in g A I identied ve main typ es of nonmonotonic revisions Recast Adjunctization

Nominalization Demotion and Promotion Each typ e is characterized by a dierent set of restructuring

transformations which involve displacing base constituents altering the base argument structure or changing

the base lexical head I present each typ e of nonmonotonic revision in turn in the following subsections

A Recast

A Recast is any typ e of transformation that involves displacing a base sub constituent to accommo date the

new content while preserving b oth the argument structure and the lexical head of the base The revision

schema of Recast is shown in Fig A An additional constituent A takes the place of a base constituent

c

lling the role R in the draft structure To accommo date A without losing the content conveyed by B

n c c

n

the latter is then moved to ll a new role R in the draft structure B

n+1 c

n

The pair of phrases b elow illustrates how Recast can b e used to add a streak information to a game

result clause

source Charlotte to ok a victory over the Atlanta Hawks

target Charlotte to ok its third straight win in a victory over the Atlanta Hawks

The additional nominal constituent its third straight win realizing the streak information corresp ond

ing to A in the schema is incorp orated as the range argument of the verb to take To accommo date this

c

new constituent in this slot its original o ccupant the game result nominal a victory over the Atlanta Base Structure Revised Structure

Bh Bh

R1. . . . . Rn R1. . . Rn Rn+1

. . . . . Bc1 Bcn Bc1 . . . . . Ac Bcn

Additional

Constituent

Figure A Recast schema

4

Hawks corresp onding to B in the schema is displaced as a time adjunct in the target phrase This is thus

c

a case of Recast of Range Argument into Location Adjunct This example shows that while involving

displacement of a base constituent a victory over the Atlanta Hawks Recast preserves b oth the

processagentrange argument structure of the source phrase and its head verb to take

The subhierarchy of Recast revision op erations is shown in Fig A Recast o ccurs inside b oth

nominals and clauses Each sub class of Recast is characterized by the source role prexed by a in

Fig A the recast constituent o ccupied in the base structure as well as the target role p ostxed by a

in Fig A it o ccupies in the revised structure In all nominal cases observed in the analyzed corp ora the

source role was the classier and the target role the qualier In clausal cases there were two source roles

the lo cation and range arguments and two target roles the instrument and time adjuncts In all cases

Recast was used to add historical information either a record or a streak and was not accompanied by any

side transformation

A surface decrement pair for each sub class of Recast encountered in the corp ora follows

Recast in Nominal from Classier to Qualier

source sports a victory over the Charlotte Hornets

4

In this case the time adjunct is intro duced by the lo cative prep osition in a common case of temp oral information conveyed

by way of a lo cative metaphor

NP−Recast S−Recast

classifier> location> range>

>qualifier >instrument >time >instrument

10 9 1 1

Figure A Recast revision op eration hierarchy

target sports their most lo opsided victory ever over the Denver Nuggets

gain source finance a

target finance its biggest one day gain of

Recast in Clause from Lo cation Argument to Instrument Adjunct

source sports to lead the New York Knicks to a victory over the Charlotte

Hornets

target sports leading the Chicago Bulls to their rd straight homecourt victory with

a blowout of the Minnesota Timberwolves

Recast in Clause from Range Argument to Time Adjunct

source sports as Utah to ok a victory over the San Antonio Spurs

target sports as Charlotte to ok its third straight win in a victory over the

Atlanta Hawks

Recast in Clause from Range Argument to Instrument Adjunct with Deleted Reference

source sports to help the Charlotte Hornets p ost a victory over the Orlando

Magic

target sports helping the Sacramento Kings p ost their rst win over the Los Angeles

Lakers with a victory

A Adjunctization

Adjunctization was already presented in Section It applies only to clausal bases headed by a

supp ort verb ie a verb that do es not convey in itself any content in the context of the clause but merely

serves as a syntactic supp ort for its meaningb earing ob ject The Adjunctization schema is given in

Fig A The additional content is realized by a combination of two new constituents a fullverb V ie a

f

verb that b ears meaning on its own and its new ob ject A Deprived of the head verb that supp orted it in

c

migrates to an adjunct p osition in the revised clause It has thus b een the base clause original ob ject B

c

n

adjunctized Adjunctization is similar to Recast in that it forces the base constituent to migrate from one

slot to another in the clause structure It diers from Recast in that

It is do es not preserve the base argument structure since a supp ortverb clause is replaced by a fullverb

clause

It do es not preserve the base lexical head since part of the new content is intro duced by a new fullverb

It is much more sp ecialized since it applies to nominals nor to clauses headed by fullverbs Base Structure Revised Structure

Vs Vf

Arg1 Argn Arg1' Argn' Adjunct

Bc1 . . . . . Bcn Bc1 . . . . . Ac Bcn

Figure A Adjunctization schema

The pair of phrases b elow illustrates how Adjunctization can b e used to add a streak information to a

game result clause

source the Denver Nuggets claimed a victory over the Dallas Mavericks

target the Denver Nuggets ended their three game losing streak with a victory over

the Dallas Mavericks

The streak information is intro duced by a new fullverb to end corresp onding to V in the schema

f

and a new ob ject nominal their three game losing streak corresp onding to A in the schema Deprived

c

of its supp ort verb to claim corresp onding to V in the schema the original ob ject nominal a

s

victory over the Dallas Mavericks corresp onding to B in the schema conveying the game result migrates

c

to an instrument adjunct Since the thematic role of the original ob ject in the source phrase was range this

is a case of Adjunctization of Range Argument into Instrument

The other typ es of Adjunctization are shown in Fig A Cases of Adjunctization are rst charac

terized by the target adjunct role of the displaced constituent In the analyzed corp ora there were three

such target roles instrument opp osition and destination Constituents that moved to destination role all

originally lled the created role role Those that moved to opp osition role originally lled the aected role

Those that moved to instrument role came from either created range or lo cation role p ositions Some cases of

Adjunctization were accompanied by side transformations of typ e reference abridging or reference deletion

The example b elow shows an interesting case where Adjunctization is coupled with another revision

to ol Demotion describ ed on its own in Section A

source to help the Spurs post a victory over the Orlando Magic

target to help end the Spurs ve game losing streak with a victory over the Los Angeles

Lakers

In addition to the Adjunctization of the game result nominal a victory over the Los Angeles

Lakers from range argument in the source phrase to instrument in the target phrase the nominal the

Spurs is also demoted from agent argument in the source phrase to genitive determiner of the aected

argument in the target phrase In this target phrase the sub ordinate ob ject clause of to help no longer

p ossesses an agent argument of its own like the source phrase Its agent is an implicit controlled one that

is shared with the emb edding help clause This example is one of the rare cases of fusion b etween two

classes of revision op erations It could have b een equally well classied as a case of Demotion

In the analyzed corp ora Adjunctization was used only to add historical information either records or

streaks It was also the most widely used nonmonotonic revision op eration

A surface decrement pair for each sub class of Adjunctization encountered in the corp ora follows

Adjunctization of Created Argument into Instrument Adjunct Adjunctization

>destination >opposition >instrument

created> affected> range> created> location> 1 1 +agent demotion

abridged full deleted full abridged ref ref ref ref ref

727 3 1 14 5 4

Figure A Adjunctization revision op eration hierarchy

source sports Patrick Ewing scored points

target sports Kevin Edwards tied a career high with points

Adjunctization of Range Argument into Instrument Adjunct

source sports to help the Charlotte Hornets post a victory over the Orlando

Magic

target sports to help the Milwaukee Bucks snap a losing streak at ve games with a

victory over the Detroit Pistons

source finance The Korean Comp osite Index posted a point rebound

target finance the market b egan showing signs of recovery with a point rebound

Adjunctization of Range Argument into Instrument Adjunct with Abridged Reference

source sports to help the Charlotte Hornets post a victory over the Orlando

Magic

target sports to help Chicago remain unb eaten against Minnesota alltime with a

victory over the Timberwolves

Adjunctization of Range Argument into Instrument Adjunct with Abridged Reference

source sports to help the Charlotte Hornets post a victory over the Orlando

Magic

target sports to help the Indiana Pacers snap the Seattle Sup ersonics game win

ning streak with a triumph

Adjunctization of Range Argument into Instrument Adjunct with Demotion of Agent

Argument Into Determiner of Aected Argument

source sports to help the Charlotte Hornets post a victory over the Orlando

Magic

target sports to help end the Spurs ve game losing streak with a victory

over the Los Angeles Lakers

Adjunctization of Lo cation Argument into Instrument Adjunct

source sports and the Chicago Bulls rol led to a victory over the New Jersey

Nets

target sports and the Denver Nuggets ended their three game losing streak with a

victory over the Dallas Mavericks

Adjunctization of Lo cation Argument into Instrument Adjunct with Abridged Reference

source sports and the Chicago Bulls rol led to a victory over the New Jersey

Nets

target sports and the San Antonio Spurs continued their mastery of Dallas with a

victory over the Mavericks

Adjunctization of Aected Argument into Opp osition Adjunct

source sports Detroit beat Indiana

target sports the Chicago Bulls won their fourth straight over the Atlanta

Hawks

Adjunctization of Created Argument into Destination Adjunct

source sports Patrick Ewing scored points

target sports Travis Mays made of free throws on his way to points Base Structure Revised Structure

Vf Vs

Arg1 Arg2 Arg1' Arg2' Arg3'

Bc1 Bc2 Bc1 Bc2 Nf

mod

Semantic Element

Figure A Nominalization schema

A Nominalization

Nominalization is the converse of Adjunctization Adjunctization presented in the previous section

applies only to clausal bases headed by a supp ort verb which it replaces by a full verb Nominalization

applies only to clausal bases headed by a full verb V which it replaces by a supp ort verb V and a new

f s

ob ject constituent This new ob ject is an NP headed by a nominal synonym N of the original full verb

f

V The revision schema of Nominalization is shown in Fig A The argument structure of the clause is

f

expanded by the intro duction of N This new head noun then serves as the anchor p oint for the attachment

f

of the constituent A conveying the additional content The nominalization in N of the content originally

c f

conveyed by V allows incorp orating the new content as a nominal instead of clausal mo dier The main

f

motivation underlying this revision is that nominal mo dications are in general more concise that clausal

ones

The pair of phrases b elow illustrates how Nominalization can b e used to add a streak information to a

game result clause

source the Seattle Sup ersonics defeated the Sacramento Kings

target the Seattle Sup ersonics handed the Sacramento Kings their rd straight defeat

The full verb to defeat corresp onding to V in the schema is replaced by a verbob ject collo cation to

f

hand a defeat to hand and defeat resp ectively corresp ond to V and N in the schema The streak

s f

information is then attached as a complex ordinal rd straight premo difying the head noun defeat

Note that the nominal the Sacramento Kings has also migrated form direct to indirect ob ject p osition

In this example the nominalization is strict the target noun defeat is the precise nominal form of the

source verb to defeat This is not necessarily the case The target noun can b e any noun whose meaning

in the context of the revised phrase corresp onds to the meaning of the source verb in the context of the base

phrase Hence the verb to b eat expressing the game result in the pattern WINNER b eat LOSER

can b e nominalized by the noun loss in the pattern WINNER hand LOSER a loss

The revision op eration of Nominalization present here is thus a more general concept than the lexical

transformation of nominalization usually describ ed in the linguistic literature

The subhierarchy of Nominalization revision op eration is shown together with that of Append in

Fig A Each Nominalization sub class is characterized by the typ e of nominal mo diers used to express

the additional content In the analyzed corp ora an ordinal mo dier was always present It was either alone

accompanied by a classier or accompanied by a qualier Nominalizations were not accompanied by any

side transformation

A surface decrement pair for each sub class of Nominalization encountered in the corp ora follows Append Nominalization

NP−Append S−Append +ordinal +ordinal +ordinal 1 +classifier +qualifier 1 2 1 simple w/ Ellipsis

1 1

Figure A App end and Nominalization revision op eration hierarchy

Nominalization with Ordinal Adjoin

source sports the New York Knicks routed the Philadelphia ers

target sports the Milwaukee Bucks handed the Denver Nuggets their fourth

straight loss

source finance the Tokyo sto ck market plunged

target finance the Tokyo sto ck market p osted its third consecutive decline

Nominalization with Ordinal and Classier Adjoin

the Philadelphia ers source sports the New York Knicks routed

target sports the Seattle Sup ersonics handed the Sacramento Kings their rd

straight road loss

source finance the All Industrials Index lost p oints

target finance the Hong Kong Sto ck Exchange suered its third straight big loss

Nominalization with Ordinal and Qualier Adjoin

source sports the New York Knicks routed the Philadelphia ers

target sports the Pho enix Suns handed the Portland Trail Blazers their rst loss

of the season

A Demotion

A Demotion is any typ e of transformation that involves displacing a toplevel argument of the base structure

to an emb edded p osition inside the constituent realizing the additional content in the revised structure The

originally a toplevel argument revision schema of Demotion is shown in Fig A The constituent B

c

n

of the new constituent in the base structure gets emb edded in the revised structure under the head A

c

h Base Structure Revised Structure

Bh Rh

Bc1 . . . . . Bcn−1 Bcn Demoted Bc1 . . . . . Bcn−1 Ach Constituent

Bcn

Figure A Demotion schema

realizing the additional content In Absorb previously presented a base constituent also b ecomes emb edded

inside the constituent conveying the new content What distinguishes Demotion from Absorb is that in the

case of Demotion

The new constituent under which the displaced base constituent gets emb edded do es not necessarily

ll the same role in the revised structure as the one the displaced constituent o ccupied in the base

structure

The revised head R may dier from the head base B

h h

The pair of phrases b elow illustrates how Demotion can b e used to add a record up date information to

a game statistic clause

source Kevin Johnson scored points

target Kevin Johnson matched his career high of points

The record up date information is conveyed by a combination of a new head verb to match corresp onding

in the to R in the schema and a new ob ject his career high whose head high corresp onds to A

h c

h

in the schema is demoted as a schema The original ob ject nominal p oints corresp onding to B

c

n

qualier inside the new ob ject constituent Since the original ob ject lled a Created role in the source phrase

and the new one lls an Aected role in the target phrase this is a case of Demotion of Created Argument

into Qualifier of Affected Argument

As shown in Fig A the analyzed corp ora also contained cases where an Aected argument was

demoted as the Determiner of a new Aected argument and cases where a Score phrase was demoted from

b eing an adjunct of the main clause to b eing an adjunct of a new sub ordinate CoEvent Adjunct clause

Cases of Demotion were not accompanied by any side transformations and were used only to add historical

information either a record or a streak

A surface decrement pair for each sub class of Demotion encountered in the corp ora follows

Demotion of Created Argument into Qualier of Aected Argument

source sports Patrick Ewing scored points

target sports Kevin Johnson matched his career high of points

source finance the All Singap ore Index closed at Demotion Coordination Promotion

score> created> affected> simple w/ Adjunctization 1 1 >score(co−event) >qualifier(affected) >determiner(affected)

1 2 1

Figure A Demotion and Promotion revision op eration hierarchy

target finance the key index had set a record high of

Demotion of Aected Argument into Determiner of Aected Argument

source sports helping Detroit beat Indiana

target sports helping the Pho enix Suns end New Jersey s six game winning streak

source finance that drove the market down

target finance that snapp ed the Dow s three day record setting streak

Demotion of Score Phrase from Main Clause to CoEvent Sub ordinate Clause

source sports the Jazz routed the Minnesota Timb erwolves

target sports the Orlando Magic shut down the Washington Bullets for the second time

in a week winning

source finance home construction climb ed p ercent

target finance home construction climb ed a second consecutive month during

Septemb er rising p ercent

A Co ordination Promotion

Coordination promotion applies only to hyp otactic base structures with a co ordinated argument It undo es

the meaning factoring achieved by such structures and transforms it into a toplevel co ordinated structure

of hyp otactic elements The emb edded co ordination in the base structure is thus promoted to the toplevel

in the revised structure Accommo dating the new content requires undoing the meaning factoring achieved

by the emb edded co ordination when only one of its elements is semantically related to the new content The

revision schema of a coordination promotion is shown in Fig A

The pair of phrases b elow illustrates how Coordination Promotion can b e used to add the record

equalling nature of a game statistic whose expression in the draft is co ordinated with the expression of

another game statistic Base Structure Revised Structure

Bh &

Arg1 Arg2

Bc1 & Bh Ah

Arg1 Arg2 Arg2

Bc2 Bc3 Bc1 Bc2 Bc3

Arg1

Ac

Figure A Promotion schema

source Larry Smith added p oints and reb ounds

target Larry Smith added p oints and matched a season high reb ounds

In each phrase the scop e of the conjunction is indicated with brackets In the source phrase the verb

to add corresp onding to B in the schema is factored out over the nominal conjunction of p oints

h

in the schema Without in the schema and reb ounds corresp onding to B corresp onding to B

c c

3 2

such factoring the same content would b e expressed by a conjunction of clauses eg

Larry Smith added p oints and grabb ed reb ounds

Because the additional record equalling prop erty concerns only the reb ounding statistic and not the scoring

one accommo dating this new content requires undoing this factorization and return to the clause co or

dination form A second verb to tie Larry Smith corresp onding to A in the schema can then b e

h

added together with a classier season high corresp onding to A in the schema to express this additional

c

content Since the agent of to tie is shared with that of the conjoined verb to add it can b e elided

The subhierarchy of Coordination Promotion revision op eration was shown together with that of

Demotion in Fig A There is only one variant from the simple case illustrated by the example ab ove

In this simple case after the the intro duction of the second verb p ercolating the co ordination up to the top

level the remaining additional content is incorp orated by adjoining a classier to the ob ject of this second

verb In the complex variant this additional content is instead incorp orated by adjunctizing the ob ject of

this second verb as an instrument On the example source phrase ab ove this variant yields the following

target phrase

Larry Smith added p oints and matched a season high with reb ounds

Coordination promotion is an example of revision op eration that is very sp ecialized and involves very

complex surface form transformations

A surface decrement pair for each sub class of Promotion encountered in the corp ora follows

Promotion of Co ordination

source sports contributed p oints and b oards

target sports Larry Smith added p oints and matched a season high re

b ounds

source finance the Nikkei traded in range b etween and

target finance the Nikkei op ened at a low of and rose to its high of

Promotion of Co ordination with Adjunctization of Co ordinated Element into Instrument Adjunct

source sports Benoit Benjamin contributed p oints reb ounds and six blocked

shots

target sports Shaquille ONeal added p oints reb ounds and tied a record

with seven blocked shots

A Side transformations

The base pattern transformations presented in the two previous sections meet one of the two following goals

Attach the additional constituents realizing the new content intro ductory transformation

Alter the surface form of the existing content to accommo date this new constituent into the structure

under revision restructuring transformation

These transformations are the core transformations involved in informationadding revisions However

they are sometimes accompanied by side transformations whose goal is to correct whatever verb ose rep eti

tions ambiguous forms or invalid word collo cations that the core transformation may have intro duced in the

draft

I identied six main typ es of side transformations reference adjustment argument control ellipsis

scope marking ordering adjustment and lexical adjustment They dier from each other in terms of

the goal they satisfy the asp ect of the draft they alter and the typ es of revision to ols that they accompany

I present the six typ es of side transformations in turn in the following subsections

A Reference adjustment

Reference adjustment the most widely used side transformation in the analyzed corp ora was already pre

sented in Section Its goal is to insure conciseness by suppressing rep etitions intro duced by the

attachment of the additional prop osition In the analyzed corp ora initial references use a set of default

prop erties asso ciated with the class of the referred entity in the domain ontology For example in the sp orts

domain initial references to teams use b oth their home and name eg the New York Knicks However

when a revision to ol intro duces a second reference this set of prop erties is then distributed b etween these two

references The most common distribution strategy is to use half the prop erties in the added reference and

5

then abridge the existing initial reference so that is uses only the remaining half This is called Reference

Abridging The pair of phrases b elow illustrates how it used as a side transformation to an Adjoin of

Relative Clause to Top Level Nominal cf Section A for a discussion of this revision to ol

source S to pace the Atlanta Hawks to a triumph over the Minnesota Timberwolves

1

target T to pace the Atlanta Hawks to a triumph over Minnesota that sent the Timb er

1

wolves to their sixth straight defeat

After the Adjoin revision intro duced a second reference by name only to the losing team the initial

5

Note that the use of this transformation is based on the assumption that the rep ort targeted audience knows ab out every

prop erty in the default set since otherwise it could not establish the coreference link in the revised pattern

draft reference is abridged from the name home form to the home only form In the case of teams in

the sp orts domain each of the two prop erties is indierently used in b oth the initial draft reference and the

reference added by the revision Thus a converse target phrase is

target T to pace the Atlanta Hawks to a triumph over the Timb erwolves that sent Min

2

nesota to its sixth straight defeat

where the prop erty deleted from the initial draft reference is the home instead of the name

Without such reference abridging side transformation the Adjoin revision would yield the rep etitious

forms b elow rep etitions are underlined

Minnesota Timb erwolves that target T to pace the Atlanta Hawks to a triumph over the

3

sent the Timb erwolves to their sixth straight defeat

target T to pace the Atlanta Hawks to a triumph over the Minnesota Timb erwolves that

4

sent Minnesota to its sixth straight defeat

that target T to pace the Atlanta Hawks to a triumph over the Minnesota Timb erwolves

5

sent the Minnesota Timb erwolves to their sixth straight defeat

An alternative strategy to reference abridging is Reference Deletion In this case the whole set of

default referring prop erties is used in the added reference and to avoid rep etition the entire initial draft ref

erence is deleted On the example source phrase S ab ove reference deletion yields the following alternative

1

target phrase

target T to pace the Atlanta Hawks to a triumph that sent the Minnesota Timb erwolves

6

to their sixth straight defeat

Reference deletion is less widely applicable than reference abridging b ecause it requires the semantic link

b etween the draft constituent from which the initial reference is deleted and the constituent added by the

revision to b e strong enough so that the deleted referent can b e inferred from the added context In the

example ab ove even though the PP referring to the losing team in the game result nominal was deleted the

fact that Minnesota was the loser of the game is still inferable from the added relative clause expressing the

losing streak extension

Reference abridging and deletion accompanied several classes of Adjoin Conjoin and Adjunctization

revisions Corp ora examples for each sub class were given in the previous sections of this app endix

A Argument control

Argument control is a sp ecialized side transformation that systematically accompanies the sp ecic case of

Absorb where the added absorbing constituent shares one argument with the absorb ed draft constituent In

such cases the common argument in the draft constituent is deleted an is control led by the corresp onding

argument in the added emb edding constituent This side transformation avoids what would constitute a

stylistically inappropriate rep etition and opp ortunistically takes advantage of this commonality b etween

draft and additional content to improve conciseness

The pair of phrase b elow illustrates this phenomenon

source S Minnesota rolled to a victory over the Charlotte Hornets

1

target T Minnesota snapp ed a three game losing streak rolling to a victory over

1

the Charlotte Hornets

In the target phrase the toplevel reference to Minnesota which lls the agent argument of the added em

b edding clause controls the omitted agent argument marked by the symb ol of the absorb ed emb edded

clause

Without such a side transformation the Absorb revision would yield the following rep etitious target

phrase

snapp ed a three game losing streak Minnesota rolling to a target T Minnesota

2

victory over the Charlotte Hornets

A Ellipsis

Ellipsis is another example of sp ecialized side transformation it accompanies only App end revisions It

opp ortunistically improves conciseness by exploiting the fact that the more elements that are present in a

co ordination the more fully each subsequent element can b e elided Therefore when revision is used to

app end a new element to a co ordinated constituent of the draft part of the co ordinated elements can b e

deleted to b ecome implicit The pair of phrase b elow illustrates this phenomenon

source S Willie Anderson scored p oints and Terry Cummings added target T Willie

1 1

Anderson scored p oints Terry Cummings and David Robinson

In the second element of the source phrase only the head of its ob ject nominal realizing the unit of the

game statistic is elided After the app ending of the third element the verb of the second element b ecomes

implicit as well In a draft and revision generation framework there is a signicant dierence b etween the

verb ellipsis in the second and third element of this target sentence only the former concerning the existing

draft material requires a side transformation the latter concerns to realization of the added constituent in the

revision context The side transformation Ellipsis presented here is thus a distinct and less general concept

than the ellipsis traditionally discussed in the linguistic literature see the system plandoc McKeown et

al for the generation of this more general typ e of ellipsis In particular it is applicable only to cases

where the new constituent shares more than one semantic elements with the draft co ordination to which it

is app ended

A Scop e marking

The goal of the three side transformations already presented is to achieve conciseness by deleting elements

in the draft that are redundant with the constituent added by the revision In some cases where the draft

content and the added content share information the most concise form results from an alternative strategy

This strategy consists of applying the revision to ol at the deep est p ossible level in the draft However this

may sometimes result in ambiguous phrases Scope marking is a sp ecialized side transformation correcting

the ambiguities generated by applying Coordinative Conjoin at the deep est p ossible level

The example phrases b elow illustrates the use of Scop e Marking with a Coordinative Conjoin to add

a second game statistic by a dierent player but of equal value and unit than the rst one

source S scored p oints

1

target T Magic Johnson scored p oints and Byron Scott added

1

target T Magic Johnson and Byron Scott scored p oints

2

target T Magic Johnson and Byron Scott scored p oints apiece

3

T resulting from the application of Conjoin at the clause level fails to maximally factors out the shared

1

semantic element b etween the draft and added content leading to the rep etition of p oints T resulting

2

from the application of Conjoin at the the clause agent level is more concise However it is also ambiguous

in that the clause can then b e interpreted b oth distributively ie each scored p oints and collectively

ie they together scored p oints In T this ambiguity is removed by the scop e marker apiece forcing

3

the distributive reading Note that the side transformation o ccurs at a higher level in the draft structure than

the revision it accompanies A dierent scop e marker total can b e similarly used to force the collective

reading This collective reading can also b e forced by adjusting the lexical head of the clause to a verb with

intrinsically collective meaning such as to combine for This alternative lexical typ e of side transformation

is presented in Section A

A Ordering adjustment

As noted in the previous section a new constituent added to draft by a revision to ol can intro duce ambigui

ties These ambiguities can sometimes b e circumvented by simply mo difying the precedence relations among

the draft constituents without altering their dep endence relations This is esp ecially true for adjuncts whose

linear p osition in the clause is largely underconstrained by syntactic dep endencies

The example phrases b elow illustrate how reordering constituents can disambiguate a revised phrase

resulting from the application of Adjoin of Relative Clause to Bottom Nominal

source S the Boston Celtics triumphed over the Miami Heat

1

target T the Boston Celtics triumphed over the Miami Heat who have lost thee consecutive road

1

games

target T the Boston Celtics triumphed over the Miami Heat who have lost three consecu

2

tive road games

In T a relative clause is added to the losing team reference to convey a streak information This ad

1

ditional clause makes the attachment of the score adjunct ambiguous is it the score of only the

last game or of each and every game in the streak In T the rst reading is forced by moving the score

2

adjunct in front of the nominal onto which the relative clause was added

A Lexical adjustment

The addition of a new constituent to the draft by a revision to ol may intro duces ambiguities It may also

violate interlexical constraints Lexical adjustment consists in changing a draft wording not to convey any

additional content but rather to avoid either of these two problems

The pair of phrases b elow illustrates the use of lexical adjustment with a Nominal Coordinative

Conjoin to add a reb ounding statistic to a scoring statistic by the same player

source Armon Gilliam scored p oints

target Armon Gilliam had p oints and reb ounds

Keeping the verb to score after this Conjoin would result in the invalid verbob ject collo cation

to score a reb ound

The emb edding head verb is thus adjusted to the versatile verb to have which can form a valid verbob ject

collo cation with any noun expressing a game statistic whether p oint reb ound assist etc Note that

as in the case of scop e marking this lexical adjustment o ccurs in a higher level draft constituent than the

revision to ols it accompanies An example of disambiguating lexical adjustment was given in Section A

It is interesting to note that four out of six side transformations accompany the two revision to ols Conjoin

and Append b oth concerned with co ordination It underlies the easily overlo oked complexity of this linguistic

phenomenon

App endix B

Implementation environment

the FUFSURGE package

In this app endix I provide background reference material concerning the implementation of the language

generator streak Surface Text Reviser Expressing Additional Knowledge streak summarizes a bas

ketball game by packing all the essential facts ab out the game as well as their historical signicance in a

single complex lead sentence It is based on the draft and revision language generation mo del put forward in

Chapter of this thesis The present app endix describ es the generation to ols that underlied the implemen

tation of this mo del It presents b oth these to ols as they existed b efore the development streak and the

particular extensions to their functionality that this development prompted The implementation of streak

itself using these to ols was presented in Chapter of this thesis

streak was implemented using the fufsurge package for developing language generation application

This package has b een develop ed by Elhadad over the past years and is still the ob ject of constant

improvement and extension In addition to streak it has b een used as the underlying environment for

the development of a wide variety of generation applications at Columbia University including multimedia

explanation McKeown et al sto ck market rep orts Smadja and McKeown and interactive

online student advising Elhadad b It has b een distributed to over research sites worldwide and is

currently b eing used for the development of pro jects at seven of these sites making it the most widely

used dedicated package for the development of language generation applications

fuf Elhadad b Elhadad a is a sp ecialpurp ose programming language for language generation

based on functional unication It comes as a package with surge a grammar of English implemented in

fuf and usable as a p ortable frontend for syntactic pro cessing fuf is the formalism part of the package a

language in which to enco de the various knowledge sources needed by a generator surge is the data part of

the package one already enco ded knowledge source usable by any generator Using the fufsurge package

implementing a generation system thus consists of decomp osing nonsyntactic pro cessing into subpro cesses

and enco ding in fuf the knowledge sources for each of these subpro cesses In the case of streak the

decomp osition into subpro cesses was presented in Chapter of this thesis In addition to surge streak

relies on three nonsyntactic knowledge sources the phrase planning rule base the lexicalization rule base

and the revision rule base The enco ding of these three knowledge sources in fuf was presented in Chapter

Both fuf and surge had to b e extended during the development of streak to meet some of its sp ecial

needs First the nonmonotonicity of streak required the implementation of new fuf op erators to cut

and paste functional descriptions aside from unifying them These extensions to fuf were implemented by

1

it only Elhadad Second while surge already covered a wide variety of syntactic forms for simple clauses

covered a very few for complex sentences which aggregate several such clauses It also did not cover the

sp ecialized nominals of quantitative domains I implemented sizeable extensions to surge to attain wide

coverage for complex sentences and quantitative nominals This set of extensions go es far b eyond what was

1

In fact as wide as any other available syntactic pro cessing frontends for language generation

needed for the sole needs of streak and constitutes in itself a signicant though not central contribution of

this thesis It was not simply an implementation matter but required to crossexamine the descriptive work of

several noncomputational linguists and integrate their resp ective analysis within the unifying computational

framework of surge For quantitative nominals some constructs I observed in newswire rep ort corp ora were

not mentioned in the linguistic literature and I had to come up with my own analysis The version of surge

resulting from these extensions has since b een used for other generation applications in addition to streak

automated do cumentation for the activity of telephone network planning engineers Kukich et al

verbal descriptions of visual scenes Ab ella and generation of business letters

In what follows I rst give an overview of fuf Functional Unication Formalism as a programming

language including the extensions for nonmonotonic pro cessing prompted by the development of streak

I then describ e the initial version of surge Systemic Unication Realization Grammar of English that

forms the core of streaks syntactic comp onent I then present the remaining of this comp onent ie the

extensions of surge to complex sentences and quantitative nominals two typ es of constructs which were

p ervasive in the corp ora of newswire rep orts that I analyzed for this thesis Finally I give the input set used

for systematically testing these extensions This test input set illustrates by example all the new features of

the extended coverage of surge

B The FUF programming language

fuf Elhadad b Elhadad a is a sp ecialpurp ose programming language for developing language

generation applications It is an extension cf Elhadad Elhadad and Robin of the initial

functional grammar formalism initiated by Kay Originally functional grammars where used to

implement only two generation subtasks morphosyntactic grammaticalization and linearization This was

the case for example in the systems text McKeown and tailor Paris The extensions to

the formalism included in fuf have b een in part motivated by its use for a ever wider range of generation

subtasks including lexicalization in several systems phrase planning in advisorI I and even nonlinguistic

tasks like media co ordination in comet and ontological deduction in plandoc The reviser of streak

constitutes the rst use of fuf for implementing a nonmonotonic task The advantage of using fuf to

implement deep er asp ects of the generation pro cess than just syntax is threefold First these deep er asp ects

can also b eneciate from the strong p oints of the formalism a fuf program works with partial information

and is declarative uniform and very compact since based on a few p owerful concepts such as functional

descriptions and unication Second using the same formalism to implement all the tasks precludes the

need for interfaces reformatting the same information from one formalism used by a given comp onent to

a dierent formalism used by another comp onent Third it allows exp erimenting with dierent ways to

distribute subtask among comp onents at implementation time for example assigning constituent gaping to

the lexicalizer vs to the syntactic grammar The limitations of fuf were discussed in Section

In what follows I present only the minimum set of fuf features necessary to provide the reader with the

ability to grasp the implementationrelated issues of this research as well as to understand the small co de

excerpts provided as examples in Chapter of this thesis For an indepth presentation of fuf see Elhadad

a and Elhadad b

B Functional Descriptions

The basic data structure used in fuf is a Functional Description FD An FD describ es a set of entities

satisfying a set of prop erties It consists of a set of pairs a v called features where a in an attribute

and v is the value of this attribute for the set of entities describ ed by the FD v can b e either an atom or

recursively an FD Allowing recursion makes FDs structured representations

An FD can have any numb er of features This variable arity of an FD makes it an inherently partial

description or approximation of a particular entity As more features as added to the FD the approximation

b ecomes more accurate This is what makes FDs so attractive for NLP applications natural language de

scriptions of a domain entity are also inherently partial The FD with no feature is noted nil An attribute nil

x ((a 1)) ((a x)) ((b 2))

1 2 ((a x) (b 2))

((a 1) (b 2))

fail

Figure B Space of FDs

with a nil value is useless in an FD since it do es not provide any information for the attribute eg

shape round size nil shape round

An attribute can o ccur only once in a given FD If it o ccurred twice with the same value the second

o ccurrence would bring no additional information and would b e useless eg

shape round shape round shape round

If it o ccurred twice with dierent values the second o ccurrence would bring contradictory information and

b e illegal eg

shape round shape square is inconsistent

The basic op eration on FDs is unication noted u thereafter Syntactically it reduces to union for FDs

with only atomic values eg

u a b a b

but involves structure mapping with emb edded FD values eg

u a b a b

c e

d c

a e d

Semantically u FD FD denotes the intersection of the resp ective denotations of FD and FD

1 2 1 2

Unication is thus used to gradually provide a description of a particular entity by identifying it as a

memb er of a gradually more restricted set It denes a partial order among set descriptions The FD nil is

the most general description and denotes the universal set of all entities The FD fail is the most sp ecic

description It is equivalent to all inconsistent FDs and denotes the empty set

This partial order on the space of structured values is paralleled with a partial order on the space of

atomic values dened by using the definefeaturetype construct eg after a

definefeaturetype x

u a x b a a b

Fig B gives an example of FD space where the ab ove definefeaturetype declaration holds The

p oints shown range from the most general FD nil to the most sp ecic fail All the direct sp ecialization

relations are shown as links b etween b oxes with those derived from the definefeaturetype are b oldfaced

The other links are derived from unication The arrow on the left side shows the direction of increasing

FD structural representation as sublists

a b d

e

c

b e a b

f b e

h i

FD equation representation equivalent to FD

a b d

a b e

a c

b e a b

f b e

h i

Figure B FD as equation set

information among these FDs Typ es allows the unication pro cess to handle hierarchically organized value

sets

B Paths

In addition to an atom or recursively a structured FD the value of a feature in an FD can also b e a path to

another feature in that FD A path is a list of attributes surrounded by curly braces Because each attribute

can app ear only once at a given depth inside an FD a path unambiguously identies a unique feature of

the FD So for example in the FD of Fig B the path value a b d unambiguously leads to the value

despite the present of a b attribute at the toplevel

A path p ointing to a feature whose value is an atom or a structured FD is direct A path p ointing to a

feature whose value is recursively a path is indirect A path p ointing to a feature whose value is nil either

explicitly or by the absence of the p ointed feature from the FD is uninstantiated So for example in the FD

of Fig B the path value of the feature under the path b e is direct that of the feature f is indirect and

its equivalent direct value is a b and that of h uninstantiated since there is no attribute i in the FD

In addition to path values fuf also allows path attributes These path attributes allow the representation

of an FD as a set of equation The structural and equative representation of the same FD are compared in

Fig B

Paths enco de equality constraints in fuf This increase of expressibility do es not come for free graph

2

ically paths transforms FDs from trees to general directed graphs The graphic representation of the FD

whose structural and equative representations were given Fig B is shown in Fig B of section B

B Functional Grammars

AFunctional Grammar FG is a metaFD sp ecifying a set of FDs An FG represents the set of FDs that

it unies with The unication of an FD with an FG noted unifd thereafter is a distinct and more

complex op eration that the unication of two FDs noted u There is no unication op eration for two

FGs unifd uses u as a subroutine It is more complex than u b ecause it handles the metaattributes alt

cset and control the metavalue given and also relative paths all of which are prop er to FGs These

2

Potentially cyclic

Initial FD a b

FG alt a alt c

d

c

Successive values of the FD

fail next branch in current disjunction

a c b c

a c b c d

fail backtrack to last choice point

a b undo intermediate enrichments

a b d

a b c d

Figure B Simple backtracking example

metakeywords resp ectively intro duce nondeterminism constituent recursion unidirectional constraints

pro cedural constraints and lo cal constraints to the unication pro cess Each subsequent paragraph discusses

one of these additions to the expressive p ower of streak

An FG is a general data structure A natural language grammar is only one the many typ es of knowledge

that it can conveniently enco de In streak it is used to enco de a lexicon a phrase planning rule base and

a revision rule base in addition to a grammar of English To avoid confusion b etween the data structure

sense and the linguistic sense of the word grammar I use FG for the former and syntactic grammar

for the latter

B Disjunction and nondeterminism

An FD is a conjunction of constraints each feature dening one constraint In an FG it is also p ossible to

dene disjunctions of constraints using the metaattribute alt meaning ALTernation The value of an alt

3

is a list of FDs Each element in the list corresp onds to a branch in the disjunction Disjunctions intro duce

nondeterminism in the unication pro cess

When an alt attribute is encountered in an FG during its unication with an FD fuf picks one branch

in the disjunction intro ducing a choice p oint and attempts to unify the FD with that branch If it suc

ceeds unication pro ceeds with the next attribute in the FG temp orarily ignoring the remaining branches

However if a failure later o ccurs when an attribute is found b oth in the FD and the FG with a dierent

value fuf backtracks to the last choice p oint and picks another branch in the corresp onding disjunction A

simple example of disjunction intro ducing backtracking in given in Fig B fuf rst take the rst branch

a in the disjunction sp ecifying the p ossible values of a With this choice unication with the FD fails

since a is incompatible with a fuf then backtracks to its last choice p oint in this case the choice

just made and picks the next branch in the disjunction a c fuf then pro ceeds enriching the FD with

c and then d When it reaches c however unication fails again since it is incompatible with

the current value c fuf thus backtracks to its last choice p oint the choice of the branch a c in

the a disjunction Before trying the next value it undo es all the enrichments b etween the failure and the

last choice p oint It then picks the third value pro ceeds to enrich the FD with d and then c and

successfully returns

Branches of disjunctions intro duced by the metaattribute alt are tried in their written order Another

3

Only metaattributes can have lists as values Regular features only atoms and FDs as values

FD x

a b x

FG alt x FG alt x

c c

cset a b x

x d

d with catattribute x

After toplevel unification

FD x

C

a b x

CSET A B

After recursive unification

FD x

c

a b x

D

cset a b

Figure B Constituent recursion

metaattribute ralt meaning Random ALTernation allows the denition of disjunctions in which branches

are tried in random order A third metaattribute opt meaning OPTional is used as an abbreviation for

the simplest typ es of disjunctions opt fd alt FD nil

fufs default systematic chronological backtracking strategy can b e overwritten by placing control anno

tations in disjunctions allowing fuf to use a more ecient dep endencydirecte d backtracking strategy cf

Elhadad and Robin The toplevel of an FG is required to b e a disjunction

B Constituency and recursion

The metaattribute cset meaning Constituent SET intro duces recursion on constituents The value of

this cset attribute is a list of paths Each path p oints to a subFD called a constituent After the initial

unication of the toplevel FD has completed fuf checks whether it has b een enriched by a cset feature

If it has each subFD sp ecied in this feature as a constituent is then recursively unied with the FG Each

subFD enriched by this recursion is then replaces the original subFD in the toplevel FD An example of

recursive unication with one constituent is given in Fig B At each step the features added by the last

recursion are upp ercased

This recursion mechanism is esp ecially helpful for building a linguistic output by reaccessing the FG for

each linguistic constituent cset allows sp ecifying the constituents in an FD explicitly Alternatively the

4

constituents can b e implicitly sp ecied by declaring a given attribute to b e the catattribute In this case

fuf will recurse on each subFD containing this attribute The default catattribute is cat An implicit

equivalent of the explicit FG example of Fig B is given next to it on the same gure

4

Thus raising the regular attribute to metastatus

B Relative paths and lo cality

In addition to the absolute paths presented in section B which p oint to a no de in an FD viewed as a

graph by listing the arcs leading to it from the toplevel node an FG can also contains relative paths which

p oint to a no de rst indicating how many levels to go up in the graph from the current no de ie the

no de representing the feature whose value is the relative path and then listing the arcs leading down to the

p ointed no de from that level A relative path starts with the caret sign meaning go up concatenated

with an integer indicating how many levels to go up the structure It is followed by a list of attributes similar

to an absolute path So for example the relative path a b means go up two levels and from there

go down following a then b The example FD of Fig B is repro duced here at the top of Fig B with its

corresp onding relative path version at the b ottom

The combination of constituent recursion and relative paths provides a p owerful abstraction to express

lo cal constraints generically For example in a syntactic grammar numb er agreement b etween the determiner

and the head of an NP is elegantly enco ded in surge by the single line determiner number head

number

in the branch of the grammar handling NPs This prohibits phrase like

a victories or

several victory

During the generation of a typical simple clause whose sub ject and ob ject are b oth NPs this branch will

b e accessed twice rst to p erform numb er agreement of the sub ject NP and then to p erform the numb er

agreement of the ob ject NP Without relative paths a distinct line would b e needed to enco de this agreement

for each semantic or syntactic role that an NP can ll in a clause or any other constituent

determiner number syntroles subject head number

determiner number syntroles object head number

determiner number syntroles indirectobject head number

determiner number syntroles subject qualifier complement head number

etc

Since each syntactic category follows many lo cal constraints and can ll a large numb er of roles relative

paths makes the grammar considerably more compact

There is a ip side to the expressiveness of relative paths they can b e ambiguous This is the case

when they form a Y conguration with other paths in the graph of an FD A simple Y conguration is

b oldfaced in Fig B showing the graphic form of the FD of Fig B It makes the relative path in the

feature f b e ambiguous The attribute f leads to the no de lab eled Y in the gure At this

no de there are three p ossible ways to go up one level b efore going back down following b and then e

Going up f which leads to the toplevel and then back to the same Y no de following b then e This

yields the value d e

Taking b which leads to the Z no de and then following b and e to the atomic value

Taking e which leads to the X no de and then to nowhere since there is no arc lab eled b down that

no de This yields a nil value

When an FG contains such a Y conguration with a relative path fuf disambiguates it by following

up the attribute next to which the relative path app ears in the text of the FG In the example ab ove this

convention means that the rst of the three p ossibilities is chosen

B Presence tests and unidirectionality

In general unication is bidirectional Dep ending whether an FG feature is already present in the input FD

with which it is unied the FG can either test part of the input or provide part of the output Consider for

example the FGs and FDs of Fig B

The feature a of FG tests part of the input when unied with FD but enriches the output when

1 1

FD contains only absolute paths

a b d

e

c

b e a b

f b e

h i

FD equivalent to FD with relative paths

a b d

e

c

b e a b

f b e

h i

Figure B Relative vs absolute paths

h i b a

nil X Z e f b c

Y 3 d e

1 2

Figure B The FD of gures and viewed as a graph

FG alt a b

FG alt a under b

FD a c

FD b c

u FD FG a b c

u FD FG a b c

u FD FG a b c

u FD FG fail

Figure B Unidirectional vs bidirectional unication

unied with FD The metavalues given and under render unication lo cally unidirectional For example

2

in contrast to a the feature a under works only as a test and thus fails with FD which

2

contains no a feature The metavalue given also allows to test for the presence of an attribute in the

input but without sp ecifying any value for it Therefore

5

a under v a given a v

B Pro cedural tests and negation

An FG can also contain the metaattribute control which takes as value a call to an arbitrary lisp predicate

The lo cal context of the FG is accessible to this call by way of sp ecial relative path parameters prexed by the

macrocharacter These sp ecial path parameters are interpreted as p ointers inside the subFD currently

unied with the FG The lisp predicate is called on the values that these parameters p oint to inside the

FD So for example in Fig B the a parameter p oints to when unied with FD and to when

1

unied with FD and FD

2 3

Like under control is an unidirectional attribute serving only as a test and never as enrichment It

allows mixing the declarative programming paradigm of fuf with the pro cedural programming paradigm

of lisp It is esp ecially useful to enco de inequalities in fuf as shown in Fig B The FG in that gure

enriches the input FD with the feature c if the resp ective values of a and b in that FD are dierent

Otherwise it enriches the FD with the feature c

B Mo dular FGs

The constructs defgrammar defalt and defconj allow the mo dular denition and reuse of FGs dis

junctions of FDs and conjunction of features ie FDs Once dened the disjunctions are referred to by

the metaattribute and conjunctions by the metaattribute

Figure B shows a small FG dened in one chunk on the right with its equivalent mo dular denition

on the left

5

Note that an attribute can app ear several times in an FG provided that all its o ccurrences but one are metaattributes or

paths

FG alt control not eq a b

c

c

FD a b

FD a b a

FD a b

u FD FG a b c

u FD FG a b a c

u FD FG a b c

Figure B Using control to enco de negation

setf g defgrammar g

alt a b alt a b

c alt e f c alt

g h conj

i j

e f defalt alt conj i j

g h defconj conj e f g h

Figure B Mo dular denition of an FG

B Monotonicity insertfd relo cate and canonicity

A central prop erty of the functional unication mechanism onto which fuf is based is that it is monotonic

Unication with an FG only enriches the input FD with additional features It do es not retract anything

all the features of the input FD are also present in the output FD

As explained in section many revision op erations I observed in humanwritten summaries are in

contrast not monotonic In order to concisely accommo date a new fact the syntactic structure and lexical

material expressing the original content of the draft is altered Due to their nonmonotonic nature these

revisions cannot b e implemented in streak by relying only on the unication op eration Instead the FD

representation of the draft must b e altered by cut and paste op erations

Prior to the development of streak the only userlevel functions of the fuf package were the two

unication op erations

fu to unify two FDs

unifd to unify an FD and an FG and which recurses on constituents

In order to extend the usability of fuf to applications such as streak which uses FDs and FGs as

underlying representations but which requires manipulating them nonmonotonically two additional user

level functions were added insertfd which pastes a smaller FD as a subFD inside a larger one under a

given path and the converse relocate which cuts a smaller FD app earing inside a larger one as the subFD

6

under a given path

The task p erformed by these two functions is far from b eing trivial due to the p otential presence of paths

in the smaller FD Relo cating a subFD containing nonlocal paths is esp ecially complex The diculties in

maintaining path semantics while cutting and pasting a subFD with these two functions are illustrated on

the simple examples of gures B and B

In Fig B the FD smaller is pasted inside the FD larger under the path x y The FD smaller

contains two uninstantiated paths and one instantiated path whereas larger contains only one path which

7

is uninstantiated Recall that paths read from the toplevel of FD and their semantics is that of an equation

b etween two features For example the feature f g a e in smaller stands for the equation a

f g a e

Once smaller gets emb edded inside larger under x y the path values inside smaller must b e prexed

by x y in order for them to still equate the same feature pairs For example in the new global context

provided by larger equation a ab ove b ecomes b x y f g x y a e

Two systemlevel fuf functions are used by the unication functions to retrieve the subFD under a

given path in a larger FD

topgdp larger path which stands for Go Down Path from TOP returns the value under path in

larger eg topgdp a b c a b c

topgdpp larger path which stands for Go Down Path Pair from TOP returns the attributevalue

pair under path in larger eg topgdp a b c a b b c

In the example run of Fig B the function calls following the call to insertfd shows that the pasting

resulting from that call could not have b e done by simply using these two functions in co ordination with fu

This would fail to change the paths inside smaller which would then b e erroneously interpreted in the new

global context For example equation a ab ove would b ecome the quite distinct c x y f g a e

This why a distinct insertfd function is needed

Note on this example how a pasting op eration can instantiates uninstantiated path in b oth FDs in

merged the values of attributes i b and h now all p oints directly or indirectly to the atomic value

6

Although I participated to their design and testing these functions were implemented by Elhadad

7

At least absolute paths as in the examples at hand

smaller

a b c d uninstantiated path

e

f g a e instantiated path

h i uninstantiated path

larger

x y c d

i x y a b uninstantiated path

setf merged insertfd smaller larger x y

x y c d

i x y a b gets instantiated by the pasting of smaller

a b x y c d x y prefix puts path in right new global context

in which it gets instantiated

e

f g x y a e x y prefix puts path in right new global context

h x y i x y prefix puts path in right new global context

in which it gets instantiated

setf second topgdpp larger x y fu topgdp larger x y smaller

a b c d

e

f g a e

h i

c d

i x y a b

larger

x y a b c d points to nowhere in new global context

e

f g a e points to nowhere in new global context

h i points to nowhere in new global context

c d

i x y a b indirectly points to nowhere

Figure B Insertfd pasting a smaller FD into a larger FD

larger

x y a b x y c d direct path local under x y

e

f g x y a e direct path local under x y

h u v i indirect path indirectly local under x y

c d

u v i x y a b indirect path nonlocal under u v

setf smaller relocate larger x y

a b canonical new phys rep of class Cb

e canonical original phys rep of class Cb

f g a e updated to point to canonical phys rep in new local context

h a b ex indirect nonlocal path becomes

direct path to canonical phys rep in new local context

c d a b noncanonical ex phys rep becomes

path to canonical one in new local context

setf smaller relocate larger u v

i

topgdp larger x y

a b x y c d points to nowhere in new local context

e

f g x y a e points to nowhere in new local context

h u v i points to nowhere in new local context

c d noncanonical phys rep

Figure B Relo cate cutting a smaller FD from a larger FD

Figure B illustrates the cutting of subFDs under two distinct paths inside the same FD This FD

larger is similar to larger except that the feature i x y b app ears under u v instead of x y

Its equation set representation is the following

a x y a e

a x y c d

a x y a b x y c d

a x y f g x y a e

a u v i x y a b

a x y f h u v i

All the paths in larger are instantiated In a nonmonotonic setting where the value of an attribute can

b e up dated attributes that happ en to share the same value and attributes that are equated by a path have

dierent semantics For example in a monotonic setting the two FDs b elow enco de equivalent information

FD a b

FD a b a

But this is no longer the case in a nonmonotonic setting If for instance the value of a is up dated to

the value of b in FD remains while in FD it b ecomes When cutting a smaller FD inside a larger FD

it is thus necessary to preserve the instantiation status of path values

In that resp ect for cutting the subFD under x y the equations aa do not p ose any diculty

The path values in these equations are local to the cutting scop e x y Cutting can thus b e done by simply

removing the x y prex from all the paths involved in these equations All the paths are still instantiated

in the result shown b elow

b a e

b c d

b a b c d

b f g a e

Now consider instead cutting the subFD under u v Equation a p oses a diculty The paths on

each side do not share a common prex The value of attribute i is an indirect path ultimately p ointing

via x y a b to the atomic value The diculty lies in the fact that this atomic value is not lo cated

in the cutting scop e u v The path value of i is thus nonlocal In such a case this path value cannot

b e preserved in the subFD that is b eing cut since this would means changing what was an instantiated

path into an uninstantiated one put it dierently it would mean changing the value of i from to nil

This path value must thus b e replaced by the atomic value to which it ultimately p oints to Because of

this need to fetch values that are out of the cutting scop e cutting a subFD cannot b e p erformed lo cally by

considering only the subFD The whole larger FD needs to b e insp ected

Coming back to cutting the subFD under x y equation a illustrates a further diculty that indirect

paths can trigger As in the case of a the paths on each side of a do not share a common prex The

path value of h thus at rst app ears to b e nonlocal like that of i in the previous example This path

value is u v i The value of i is itself a path x y a b In turn the value of b is a path x y c d

This last path leads to the atomic value The imp ortant p oint here is that this chain of path indirections

leads back to a lo cation within the cutting scop e x y Thus in the global context of larger the path

value of h really expresses an equality relation b etween h and d Since d is under the cutting scop e x y

this equality relation is really local to this scop e and must b e preserved in the cut FD Its lo cality is only

obscured by indirections rst out of but then back in this cutting scop e Thus the value of h in the subFD

cut from larger under x y must b e the path c d instead of the atom Path values which do not

share a common prex with the cutting path can therefore not b e uniformly handled by replacing them by

the nonpath value to which they ultimately p oint

The two examples ab ove show that two typ es of information lo cality and instantiation status which are

hidden by the list or equation set representations of an FD are crucial to the task of cutting a subFD inside

smaller f g noncanonical phys rep of class Cb

h

a e f g direct path to

b f h direct path to

c d a b indirect path to

smaller f g noncanonical phys rep of class Cb

h c d direct path to

a e a g direct path to

b c d direct path to

c d

Figure B Alternative list representations of the same FD

a larger FD while preserving the semantics of all paths This suggests that the rst stage in the cutting

pro cedure should b e a change of representation that explicitly reveals this information Such representation

is the quotient set Each element in a quotient set is a pair whose rst element is a set of equivalent paths

and whose second element is the nonpath value that is commonly p ointed to by these equivalent paths For

classes of equivalent uninstantiated paths this second element is nil The quotient set representation of

larger is given b elow

Ca x y f gx y a e

Ca x y f hu v ix y a bx y c d

The equality b etween attributes h and d that is obscured through indirections in the list representation

of smaller in Fig B is immediately clear in this quotient set they b elong to equivalent paths On the

list representation cutting the subFD under x y would involve following path chains while monitoring

moves in and out the cutting scop e On this quotient set representation it simply involves discarding

from each class the paths without a prex matching the cutting scop e and removing this prex from the

other paths The result for larger is shown b elow

Cb f ga e

Cb f ha bc d

The function relocate relies on this quotient set representation to cut a subFD inside a larger FD It

thus op erates in three steps

Convert the larger FD to quotient set form

Compute the quotient set of the subFD under the cutting scop e

Convert back the subFD quotient set into list form

The third step is nontrivial due to the fact that in most cases there are several p ossible list representa

tions for each quotient set representation For example the quotient set ab ove representing the subFD cut

under x y inside larger can b e alternatively represented by smaller and smaller in Fig B

in addition to smaller in Fig B

Comparing these three equivalent list representations reveals the sources of multiplicity in the mapping

from quotient set representations to list representations

The use of indirect paths

The lo cation in the list structure of the nonpath value asso ciated with each class of equivalent paths

The particular element in a class of equivalent paths that hold the nonpath value of the class is called

the physical representative of the class

Allowing relocate to pick a unique list representation for the subFD that it cut from the larger FD

using the quotient set representation involves

Using only direct paths

Dening a canonical physical representative for a path class

The choice of canonical physical representative can b e left to the particular nonmonotonic application

for which relo cate is needed A userprovided predicate cho osing which of two paths is the most canonical

can b e passed to relocate as an optional parameter The streak implementation uses this facility to dene

canonicity in a way that reects the realization relations which hold b etween the three layers of the draft

representation the reviser manipulates For example this predicate sp ecies that the features at the dss

layer are more canonical than those at the sss layer which are in turn more canonical than those at the dgs

layer

In the absence of such a userprovided sp ecication of which paths in a class should b e its physical

representative relocate uses a criteria that optimizes the cut subFD for subsequent pro cessing When

unifying two FDs the fuf interpreter typically sp ends most of its time following paths Eliminating indirect

paths is already a go o d source of optimization Another is to cho ose the shortest path as the physical

representative For paths of equal length relocate uses the alphab etical order as tiebreaker This is

why the physical representative Cb in smaller is a e and that of Ca is a b The example calls to

relocate in Fig B illustrates the fact this function simultaneously p erforms several tasks

Fetching the values of the genuinely nonlo cal paths eg i in smaller

Preserving the equality b etween lo cal attributes that are linked through paths meandering out and

back in the cutting scop e in the larger FD eg h a b in smaller

Moving the nonpath value of the equivalent path class to its canonical physical representant eg b

and d a b in smaller

Changing all the direct or indirect lo cal paths to the physical representative in the larger FD by direct

paths to the canonical physical representativeeg h a b in smaller

relocate outputs a list that is a canonical representation of the corresp onding quotient set This canon

ical representation is the most ecient among the many p ossible representations which corresp ond to the

same quotient set For example consider unifying the following FD c d with smaller and

smaller resp ectively Checking for the compatibility b etween the value of d in this FD and the corre

sp onding value in smaller involves visiting only two no des c and d Doing the same with smaller

requires visiting the no des c d a b f and d An imp ortant side eect of this default denition for the

physical representative is that relocate called with an empty path parameter can b e used to prepro cess an

FD by putting in the canonical form that is the most eciently pro cessed by the fuf interpreter Contrast

the output of relocate with that of topgdp with the same FD and path parameters and b oth given in

Fig B the function used by fuf to retrieve the value of a subFD in the monotonic context of unication

It do es not put the features of the subFD in their new lo cal context nor do es it p erforms any optimization

on its path values The relation b etween canonicity and eciency its further discussed in section

B The SURGE unication grammar of English

In this section I present the initial version of the syntactic grammar surge This was the version available

when I started the implementation of streak The extensions I made to surge yield version which is

presented in the next section surge is a p ortable frontend for the development of generation application

distributed as part of the fufsurge package fuf is the formalism part of the package and can b e used to

implement various comp onents of a generation system in the case of streak all four comp onents surge

is the data part of the package one particular know ledge source written in fuf implementing one reusable

comp onent

surge was implemented by MElhadad and represents his own synthesis within a single working

system of the descriptive work of several linguists most of which come from the systemic linguistic scho ol

and all but one of which noncomputational His main sources of inspiration were Fawcett and Lyons

for the semantic asp ects of the transitivity system Melcuk and Pertsov for its lexical asp ects

Halliday and Winograd for the other clause systems the nominal systems and the overall

organization of the grammar Pollard and Sag for the treatment of longdistance dep endencies and

last but not least Quirk et al for the many linguistic phenomena not mentioned in other works

yet encountered in many generation application domains In its incomparable comprehensiveness attention

to detail and wealth of examples Quirk et al exemplies the all to o rare typ e of linguistic work

that constitutes a genuine treasurehouse for the NLP practitioner In addition by b eing largely neutral

with resp ect to formalism and yet semantically and functionally oriented it is esp ecially wellsuited to a

generation p ersp ective Many of the extensions I made to surge were inspired by Quirk The goal of

this section is twofold

To lay ground for the presentation of this set of extensions in the next section by dening the starting

p oint

To provide an implementationlevel description of the dgs layer of internal utterance representation

that was abstractly describ ed in section

I will therefore not discuss the signicant p ortion of surge that was neither the ob ject of an

extension nor used in the particular sublanguage of streaks sp orts rep orting domain See Elhadad b

for a discussion of these asp ects In what follows I rst review the generation subtasks carried out by surge

I then discuss in some detail the treatment of the clause in general and the transitivity system in particular

I conclude by surveying the treatment of nominals and paratactic complexes

B The task of a syntactic grammar

An input to surge is a skeletal lexicosyntactic tree that sp ecies a sentence to generate by recursively

indicating the thematic role organization op enclass lexical items syntactic categories and asso ciated syn

tactic features of each constituent Internally surge is divided into two comp onents the syntactic grammar

prop er and the linearizer The grammar prop er is unied with the skeletal sentence sp ecication given in

the input to pro duce a corresp onding enriched sentence sp ecication that is guaranteed to contain all the

information needed by the linearizer to pro duce a grammatical string of prop erly inected English words

Both the grammar prop er and the skeletal input are FDs and so is the enriched sp ecication resulting from

their unication

The nature and scop e of this enrichment pro cess is now illustrated on a example sentence of maximum

simplicity They are winning The skeletal input accepted by surge for that sentence is shown b elow

cat clause

process type material effective no lex win tense presentprogressive

partic agent cat personalpronoun number plural

8

The corresp onding enriched description passed to the linearizer is shown in Fig B

8

This enriched sp ecication was manually p ostedited to show only the most essential features The real output contains

cat simpleclause

genericcat clause

process type material

lex win

effective no

tense presentprogressive

agentive yes

voice active

subcat oblique

cat simpleverbgroup

genericcat verbgroup

dativemove no

insistence no

polarity positive

modality none

person syntroles subject person

number syntroles subject number

event cat verb lex process lex ending presentparticiple

be tense present

person process person

number process number

ending process ending

lex be

cat verb

pattern dots be dots event dots

partic agent cat personalpronoun

number plural

genericcat np

person third

syntfunction subject

case subjective

syntax case subjective

animate yes

number plural

person third

definite yes

semantics index animate yes number plural person third

reference type specific possessive no interrogative no quantitative no

pattern determiner head dots

head cat pronoun

number plural

animate yes

pronountype personal

person third

case subjective

lex they

determiner cat articledet

genericcat det

headcat pronoun

det cat article lex

partitive no

oblique partic agent

syntroles subject oblique

verb process

focus subject

mood declarative

pattern dots start syntroles subject dots verb dots

Figure B Enriched sp ecication for the sentence They are winning

The overall task of a syntactic grammar is morphosyntactic grammaticalization and covers the following

subtasks

map the thematic structure onto the syntactic structure

p erform agreement

provide default values for the required syntactic features

cho ose closedclass words

compute precedence constraints

enco de grammatical paraphrasing

prevent overgeneration

Using the enriched sp ecication of Fig B I now illustrate what each of these subtasks involves

The roles of the thematic structures are semantic such as agent process affected etc Dep ending on

a variety of other factors such as voice these thematic roles are mapp ed onto dierent roles of the syntactic

structure such as subject verb object etc For the simple sentence of Fig B this mapping is shown in

9

lines the thematic structure of this sentence contains only two elements the process describ ed by

the sentence and its agent They are resp ectively mapp ed onto the two syntactic roles verb and subject

The pro cess to verb mapping is direct while the agent to sub ject mapping is indirect through an intermediate

oblique role for reasons explained in the overview of surges voice system section B As most of

the pro cessing p erformed by a syntactic grammar directly acts up on the syntactic roles thematic structure

mapping must b e the rst task carried out by a syntactic grammar It is largely indep endent from the other

semantic grammaticalization subtasks and could alternatively b e implemented as a separate comp onent

This comp onent would interface a lexical cho oser and a syntactic grammar accepting descriptions in terms

of syntactic roles But since thematic role mapping is domain indep endent it was integrated into surge

with the desire to provide a frontend accepting as highlevel as p ossible an input representation while

remaining reusable However b ecause it is implemented in fuf and thus works with partial information

through unication surge accept input descriptions at various levels It can thus also handle input already

sp ecied in terms of syntactic roles or even as a mix of thematic and syntactic roles A fufsurge user can

thus adjust the level at which surge takes over in the generation pro cess to suit the needs of a particular

application

A syntactic grammar must p erform agreement among the sentence constituents In surge this is done

by paths conating features whose value must coincide for a sentence to b e grammatical An agreement

resulting from a series of constraint propagations is enco ded as a chain of indirect paths For example the

verbsub ject agreement in the simple sentence of Fig B is enco ded in lines and The verb

p oints to the pro cess whose p erson and numb er in turn resp ectively p oint to the corresp onding features in

the sub ject

10

A syntactic grammar must provide a default value in general the most common one for every feature

required by the linearizer but not sp ecied in the input For example the numb er p erson and tense are

required for a verb for the linearizer to b e able to prop erly inect it In the example of Fig B numb er is

propagated from the agent numb er sp ecied in the input but p erson is simply added with its default value

third These defaults allow for concise input containing only default overriding features For example the

feature input shown ab ove includes the numb er feature to override its default singular value with plural

A syntactic grammar must cho ose closedclass words such as auxiliaries pronouns articles etc In the

three word sentence example at hand only the main verb to win was given in the input The two others

the auxiliary verb to b e and the sub ject pronoun they were added by surge cf lines and in

Fig B While some syntactic features are required to allow the linearizer to cho ose the right inection of

ab out features

9

Used here in its systemic linguistic sense to refer to any kind of situation involving participants abbreviated partic whether

an event an activity a state or a relation and thus not carrying any restrictive connotation with resp ect to asp ect

10

The notion of most common remains intuitive in surge since it it is not based on o ccurrence counts on large corp ora

an op enclass word others are required to allow the syntactic grammar prop er to cho ose the right function

words

A syntactic grammar must accept and compute precedence constraints among constituents In surge

this is done using the fuf metaattribute pattern sp ecically designed for this very purp ose Examples of

patterns are given on lines and of Fig B The value of this feature is a list whose elements can

b e either

The name of an attribute lo cated at the same level as the pattern feature in the FD structure eg

verb on line

A path leading to a feature lo cated at another level than the pattern feature in the FD structure eg

syntroles subject on line

The wildcard dots allowing the order sp ecied by the pattern to b e partial there are three of them

on line

Dummy features not corresp onding to any linguistic constituent but present as landmarks for the

expression of partial ordering constraint eg start on line

A sp ecial unication pro cedure is used for these patterns allowing progressive renement the order of

constituents inside the sentence as constraints on them b ecome available In Fig B

Line enco des precedence among the toplevel sentence constituents sub ject rst then verb

Line precedence inside the verb group auxiliary b e rst followed by main verb event

Line precedence inside the sub ject determiner rst empty in that particular case cf line

the head the pronoun they in that particular case

When unication nishes the nal complete pattern has b een computed It b ears a sp ecial meaning to the

linearizer It indicates which among all features of the enriched FD corresp ond to a syntactic constituent to

output in the sentence

Another task of a syntactic grammar is to enco de grammatical paraphrasing This paraphrasing involves

regular syntactic transformations such as passive or dative moves For example surge can generate the

following set of sentences from the same input description in terms of thematic roles

a Orlando handed a defeat to Toronto

a Orlando handed Toronto a defeat dative move

a A defeat was handed to Toronto by Orlando passive

A syntactic grammar must enco de the syntactic constraints on such paraphrasing to avoid overgeneration

For example supp ose that in the input description of the example ab ove the sp ecication for the a defeat

NP was changed to simply cat personalpronoun Then surge would blo ck the p ossible choice of the

dative move to avoid generating the ungrammatical second form b elow

b Orlando handed it to Toronto

b Orlando handed Toronto it dative move

d It was handed to Toronto by Orlando passive

B The clause subgrammar

The clause grammar is divided in three main systems

The transitivity system which handles the mapping from the thematic structure onto a default syntactic

structure for main assertive clauses

The voice system which handles departures from the default syntactic structure such as passive dative

moves clefting dislo cation etc

The mo o d system which handles interrogative imp erative sub ordinate and emb edded forms

I overview the surge implementation of each of these systems in turn in the following paragraphs

B The transitivity system

The thematic roles accepted by surge in the input description of a clause are divided into two broad classes

nuclear roles called participants and abbreviated partic and more p eripheral ones called circumstantials

abbreviated circum Intuitively participants answer the questions whowhat was involved ab out the

pro cess describ ed by the clause whereas circumstantials answer the questions whenwherewhyhow did

11

it happ ened Semantically pro cesses are classied in a hierarchy of pro cess typ es and each typ e is

asso ciated with its sp ecialized set of participant roles In contrast circumstantial roles are versatile and

can attach to pro cesses of virtually any typ e Syntactically three tests can b e used to decide whether a

particular sentence constituent is a participant or a circumstantial Is it movable Is it obligatory Can it

become subject In general

Participants cannot b e moved around in the clause without aecting the other elements Circumstan

tials can

Participants cannot b e omitted from the clause while preserving its grammaticality Circumstantials

can

Nonsub ject participants can b ecome sub ject via transformations such as passivation Circumstantials

cannot

12

Several criteria are needed b ecause none of them is always strictly applicable However taken together

they allow distinguishing the participants from the circumstantials of almost any clause Consider for ex

ample the sentence b elow

Orlando defeated Toronto yesterday Its pro cess is expressed by the verb to defeat and it has three

thematic roles resp ectively expressed by Orlando Toronto and yesterday Yesterday can b e moved

13

upfront without interfering with the rest of the clause whereas Orlando and Toronto cannot

a Yesterday Orlando defeated Toronto

b Defeated Toronto yesterday Orlando

c Toronto Orlando defeated yesterday

Yesterday can b e omitted from whereas Toronto and Orlando cannot

a Orlando defeated Toronto

b Defeated Toronto yesterday

c Toronto defeated yesterday

Finally the passive transformation allows Toronto to b ecome the sub ject of the verb to defeat but

there is no such transformation to do the same with yesterday

a Toronto was defeated by Orlando yesterday

b Yesterday was defeated Toronto by Orlando

Therefore in ab ove b oth Toronto and Orlando are participants whereas yesterday is a cir

cumstantial

I now present the set of participants and circumstantials dening the vo cabulary of thematic structures

accepted as input by surge Participants are asso ciated with sp ecic typ es of pro cesses classied in a

11

The particular hierarchy of pro cess typ es enco ded in surge is presented in detail later on in this section

12

ie there are exceptions for each these rules

13

Although c is strictly sp eaking grammatical it is appropriate only in very restricted textual contexts in contrast to

a which is appropriate in as many textual contexts as

semantic hierarchy At the top of this hierarchy surge distinguishes b etween simple and composite pro cesses

Simple pro cesses corresp onds to clauses with to participants and no causative or resultative meaning

Comp osite pro cesses corresp ond to causative or resultative clauses with or participants In English a

clause can have at most participants

Figure B shows the subhierarchy of simple pro cess typ es in surge illustrating each participant

structure lo cated at the leaf of this hierarchy by an example sentence Simple pro cesses are rst divided into

two broad classes events whether punctual or durative and relations whether stative or dynamic

Events are then further divided into material events involving entities of the physical realm either literally

or metaphorically mental events involving entities of the cognitive and emotional realms and verbal events

involving entities of the communicative realm Op erating in orthogonal semantic realms these three typ es of

events have entirely disjoint sets of participants There are only material events in the semantic subdomain

of streak

A material event can b e either agentive or eective or b oth It is agentive if the clause presents the event

as resulting from the action intentional or not of some agent It is eective if the clause sp ecies the eect

of that action This eect can b e either dispositive if it aects a preexisting entity or creative if it creates

a new one All nonagentive material events have a single participant Aected or Created However there

is a typ e of material noneective events involving two participants The rst is the Agent and the second

is called the Range b ecause it generally sp ecies some domain for the activity undergone by the agent For

example in Tony climb ed the mountain the mountain is neither aected nor created by Tony activity

but only sp ecies the range of Tonys climbing In general given a clause of the form

Subject Verb Object the test to decide whether Object is a Range is the paraphrase

What Subject did to bject was to Verb it

This paraphrase is valid for Aected and Created ob jects but not for Range ob jects as shown by the non

sensicality of

What Tony did to the mountain was to climb it

When used with very general verbs eg to make to take to get to have the Range participant

sp ecies the action itself in a nominal form In that case the verb and the noun heading the NP lling the

Range participant are verbob ject collo cations as in Michael takes a shower

Relations are divided into attributive relations linking an entity to one of its prop erty and equative rela

tions linking two entities The former ascrib es an attribute participant the prop erty to a carrier participant

the entity The latter identies one entity the identied participant by relating it to another entity the

identier In general equative relations are reversible either lexically or through passive transformations

whereas attributive relations are not For example Bo owns this car can b e paraphrased by

This car b elongs to Bo or by This car is owned by Bo

whereas Bo has long hair cannot b e paraphrased neither by

Long hair b elongs to Bo nor by

Long hair is had by Bo

The example ab ove is a case of possessive relation This typ e of relation is distinguished b ecause a

large class of relational meanings involve ownership either literally or metaphorically An even larger class

of relational meanings involve a lo cation also either literally or metaphorically and are distinguished as

locative relations The remaining typ es of relations are called ascriptive and they include copulative clauses

Note that this distinction among ascriptive possessive and locative relations is orthogonal to the distinction

among attributive and equative relations

Following a lo calist approach cf Lyons pp surge views existential clauses as expressing

a lo cative relation lacking a lo cation participant For example There is a catch is analyzed as a syntactic

shortcut to the semantically equivalent There is a catch here or There is a catch somewhere surge also

classies as lo cative the zero participant clauses describing natural phenomena For example It rains is

analyzed as a syntactic shortcut to the semantically equivalent There is rain falling here Accompaniment

and temp oral relations are also analyzed as lo cative metaphors Relations that are literally lo cative are

called spatial

event material Agentive simple Michael squats

noneective Agent

with range Michael takes a shower

Agent Range

agentive creative Michael makes falafels

eective Agent Created

disp ositi ve Michael eats falafels

Agent Aected

eective Falafels co ok

nonagentive Aected

creative Windows p op

Created

mental Francisco thinks

Processor

Francisco liked how he won

Processor Phenomenon

verbal Michael sp eaks

Sayer

Michael talked to Cathie

Sayer Addressee

Michael said strange stu

Sayer Verbalization

relation ascriptive attributive Michael is very busy

Carrier Attribute

equative The hunter is the hunted

Identied Identier

p ossessive Francisco owns a big b oat

Possessor Possessed

Identied Identier

attributive Francisco has long hair

Possessor Possessed

Carrier Attribute

lo cative natural It rains

;

existential There is a catch

; Located

accompaniment Michael is with his wife

Located Accompaniment

temp oral The end is so on

Located Time

spatial attributive Michael is far away

Located Location

Carrier Attribute

equative The tree reaches the ro of

Located Location

Identied Identier

Figure B Simple pro cess hierarchy in surge

Figure B shows the subhierarchy of comp osite pro cesses typ es in surge again with each participant

structure lo cated at the leaf of this hierarchy illustrated by an example A comp osite pro cesses results

from the sup erp osition within a single syntactic structure of two semantic structures an event participant

structure and a relation participant structure One syntactic constituent merges these two structures by

simultaneously realizing two participants one from the event structure and one from the relation structure

This meaning sup erp osition is p ossible b ecause of an underlying causeeect relationship b etween the event

and relation comp onents of the clause This comp ositional and unifying view based on Fawcett allows

the analysis of explicitly causative clauses eg Mary made him a go o d man together with the other typ es

of clauses with three participants eg Mary gave John a b o ok Mary put the b o ok on the table It

14

neatly eliminates the need for a beneciary participant which simply b ecomes b oth aected by the agents

action and the carrier of the relation resulting from that action

There are thus three main criteria to classify comp osite pro cesses

The pro cess typ e of its event comp onent in columns and of Fig B

The pro cess typ e of its relation comp onent in columns and of Fig B

Which participants of the event and relation structure are merged into one syntactic constituent in

dicated by alignment b elow each example sentence in Fig B

The choice of the merging anchor b etween the two participant structures elegantly accounts for the se

mantic dierence b etween sentences like

He made the Knicks a go o d team where the ob ject is b oth Aected and Carrier

vs sentences like

He made the Knicks a go o d coach where the sub ject is b oth Agent and Carrier

In the current implementation of surge only material events take part in comp osite pro cesses A similar

comp ositional approach involving mental and verbal events is p ossible as well It would signicantly extend

the coverage of ditransitive and complex transitive clauses However not all such clauses can b e analyzed

within this framework This is not a limitation of this particular approach but rather of any approach based

on a general participant set No matter how comprehensive comp ositional and welldened such a general

set may b e it is always p ossible to come up with a particular verb whose argument structure do es not neatly

ts into it The immense semantic variety of verbal arguments led to lexicalist approaches eg Melcuk

and Polguere where no attempt is made to make any generalization across verbs Each verb is treated

as idiosyncratic with participants not semantically lab eled and instead given an arbitrary numb er However

from a knowledge engineering p ersp ective it is always b est to capture as many generalizations as p ossible

To do this surge relies on general participant structures of gures B and B for most verbs and falls

back on the lexicalist approach for those verbs whose argument structure do es not t any of these participant

structures Since only one verb to help did not corresp ond to any of surges general participant sets in

the target sublanguage of streak I will not discuss surges lexical processes here See Elhadad b

The set of circumstantials accepted in input by surge is given is Fig B These roles can app ear

in the input description of a clause indep endently of its pro cess typ e The Reason Purpose Instrument

Manner and Accompaniment circumstantials are based on Winograd while Behalf comes from Halliday

The lo cative and temp oral circumstantials are somewhat adho c For most of these circumstantials

surge can generate only prep ositional realizations In Fig B NP is indicated in parenthesis next to

PP b ecause these prep ositional circumstantials are sp ecied in the input as NPs It is surge that maps

them onto a PP headed by the corresp onding default prep osition

The participant sets of surge allows the generation of a wide variety of b oth the simple sentences

made of a single clause and of the more complex sentences containing emb edded clauses in nominal function

The set of circumstantials ab ove also allows the generation of a few cases of the even more complex sentences

containing emb edded clauses in adverbial function The variety of semantic relation and syntactic realization

of adverbial clauses in English is far greater than what is covered in Fig B I extensively come back to

the limitations of this set of circumstantials in section B

14

Whose denition is always problematic

Event Typ e Relation Typ e Example

agentive noneective ascriptive attributive Johnson b ecame rich

Agent

Carrier Attribute

equative The Bulls b ecame the Champs

Agent

Identied Identier

p ossessive attributive Orlando picked Shaquill e

Agent

Carrier Attribute

lo cative Seikaly went to Miami

Agent

Carrier Located

disp ositive ascriptive Riley made New York a go o d team

Agent Aected

Carrier Attribute

Riley made New York a go o d coach

Agent Aected

Carrier Attribute

equative The Nets made Coleman the richest

Agent Aected

Identied Identier

p ossessive attributive The Nets gave Coleman more money

Agent Aected

Possessor Possessed

lo cative Price threw the ball out of b ounds

Agent Aected

Located Location

creative ascriptive Peter brews his b eer very strong

Agent Created

Carrier Attribute

lo cative Michael op ened windows on the screen

Agent Created

Located Location

nonagentive disp ositive ascriptive The game turned wide op en

Aected

Carrier Attribute

p ossessive Coleman received an oer

Aected

Carrier Attribute

lo cative Coleman fell on the o or

Aected

Located Location

creative The windows p opp ed on the screen

Created

Located Location

Figure B Comp osite pro cesses in surge

semantic syntactic Example

class role category

Lo cative OnLo c PP NP Bo kissed her on the platform a

InLo c PP NP In Kansas City Bo called Gwen b

AtLo c PP NP At Kansas City Bo called Gwen c

FromLo c PP NP From Kansas City Bo called Gwen d

ToLo c PP NP Bo sent the letter to Kansas City e

Temp oral Time Adverb Yesterday Bo triumphed f

PP NP On Monday Bo triumphed g

Background nite S As he received the ball Bo smiled h

ing S After receiving the ball Bo smiled i

ed S Once injured Bo grimaced j

Causative Reason PP NP Because of injury Bo did not play k

nite S Because he was injured Bo did not play l

Purp ose PP NP For enough money Bo would play anywhere m

innitive S To gain freeagency Bo held out n

Behalf PP NP For the Raiders Bo scored twice o

Determinative Accompaniment PP NP With her Bo would go anywhere p

Pro cess Instrument PP NP Bo pushed him with b oth hands q

Manner Adverb Tenderly Bo kissed her r

Figure B Circumstantials in surge

B The voice system

The tables of gures B and B show the diversity of participants asso ciated with the dierent semantic

pro cess typ es that surge accepts as input descriptions However for all these pro cess typ es each participant

ultimately surfaces up as one of the only seven p ossible syntactic roles allowed by the syntax of English for

15

realizing participants given here in their partial precedence order

subject indirectobject directobject dativeobject passiveobject subjectcomplement objectcomplement

For each participant set listed in the previous section surge must thus cho ose a mapping onto a subset

of these syntactic roles This mapping is manytomany and dep ends not only on the participants seman

tics but also up on a variety of other factors such as voice dative moves clefting and dislo cation surge

decomp oses this complex manytomany mapping into two stages

A manytoone mapping from participants to oblique roles taking into account only the participants

semantics

A onetomany mapping from oblique to syntactic roles taking into account only the clause structure

variations ie voice dative etc

Oblique roles are simply lab eled by integers from to reecting the roles degree of centrality to the

clause For a ditransitive clause in the default active nondative form oblique is sub ject oblique is

indirectob ject and oblique is directob ject This twostage mapping is illustrated on the example sentences

b elow

a Orlando defeated Toronto

Agent Aected

Oblique Oblique

Subject DirectObject

15

The order is partial b ecause except for the verb none of these elements is obligatory in every clause

nite declarative The Knicks won the game

interrogative yesno Did the Knicks win the game

wh Who won

b ound nominal That the Knicks won the game was exp ected

relative simple the game that the Knicks won

emb edded the team against which the Knicks won the game

nonnite imp erative Win that game

participle present The Knicks celebrated after winning the game

past Once the game won the Knicks celebrated

innitive to The Knicks are able to win the game

forto Ewing must dominate for the Knicks to win the game

Figure B Mo o d in surge

b Toronto was defeated by Orlando

Aected Agent

Oblique Oblique

Subject PassiveObject

a Michael said something

Sayer Verbalization

Oblique Oblique

Subject DirectObject

b Something was said by Michael

Verbalization Sayer

Oblique Oblique

Subject PassiveObject

The intermediate oblique layer allows to capture the syntactic similarity of the passive transformation

from a to b and a to b resp ectively despite the semantic dierence b etween the participant sets

in these two sentence pairs

B The mo o d system

The mo o d system deals with the variations in clause form that dep ends on the typ e of sp eech act a clause

is to p erform ie whether it is an assertion a request for action a request for information etc as well

as to the p osition and syntactic function that the clause o ccupies in the overall sentence structure As for

pro cess typ es clausal mo o ds are organized in a hierarchy Figure B shows the mo o d hierarchy covered

by surge with an example sentence provided for each leaf element

The toplevel mo o d distinction is b etween nite clauses whose verb is conjugated and agrees in p erson

16

and numb er with the sub ject and nonnite clauses whose verb is invariant

Finite clause are further decomp osed into interrogative clauses mostly used for requesting information

and declarative bound and relative clauses mostly used for asserting facts The dierence b etween the latter

three lays in their resp ective p osition in the sentence structure namely

Main ie the toplevel sentence for declarative mo o d

Sub ordinated ie mo difying another clause for b ound mo o d

16

When the sub ject is not explicitly present but instead controlled by an emb edding clause the emb edded clause verb must

agree with the sub ject of the emb edding clause

Emb edded ie mo difying an NP for relative mo o d

Only one typ e of nonnite clause can b e a main clause imperative clauses mostly used for requests The

two other typ es of nonnite clauses participle and innitive clauses are used for assertions and are either

sub ordinated or emb edded

The target sublanguage of streak contained only clauses at mo o ds used for assertions Some of these

mo o ds though were not covered by surge I discuss the extensions I made to surges mo o d system in

section B

B The nominal subgrammar

The fundamental fact separating nominals from clauses is that the former are considerably more semantically

versatile than the latter There is a nominal paraphrase for virtually any piece of content expressed by any

other syntactic category whereas the availability of clausal paraphrases for content expressed by other

categories is the exception For example nominalizations allow the transformation of the clausal description

of an event into a nominal one eg

Orlando defeated Toronto can b ecome

The defeat of Toronto at the hand of Orlando or

Orlandos victory against Toronto etc

But there is no clausal form available to simply refer to an entity indep endently of their taking part in

any event or relation as the nominals Orlando and Toronto do Beyond nominalization NPs whose

head denote a very broad category like fact event prop erty place etc allow the content range of

nominals to include the content range of the other main syntactic category as illustrated by the examples

b elow

on top of the tree PP the place on top of the tree NP

Michael is busy attributive clause the fact that Michael is busy NP

knowledge intensive adjectival phrase the prop erty of knowledge intensiveness NP

Because nominals cover such an unrestricted range of meaning no semantic classication of the entity

typ es they convey has b een worked out to the date that matches in comprehensiveness and could b e

integrated with the systemic linguistic semantic classication of the pro cess typ es that underlies the clause

grammar of surge Some interesting investigations in that direction have b een carried out by some authors

such as Vendler or Levi but each discuss only a narrow range of nominal meanings and more

studies fo cusing on other ranges are needed b efore the synthetic integration of these indep endent eorts

within a unifying framework can even b e contemplated

Consequently at the nominal rank surge accepts descriptions only in terms of syntactic roles instead

of either as thematic or syntactic roles at the clause rank In a generation application it thus b ecomes

the resp onsibility of the lexicalizer to map in the applications restricted range of nominal meanings the

thematic role substructure it receives as input onto the syntactic role structure to pass to surge How this

task is p erformed in streak is discussed in section The sp ecic syntactic roles accepted by surge for

nominal description are given in their partial order of precedence

Determinersequence Describer Classier Head Qualier

The determinersequence is itself decomp osed into the following elements

Predeterminer Determiner Ordinal Cardinal Quantier

The syntactic category accepted as ller for each of these roles by surge together with an illustrative

example nominal is given in Fig B The constituent lling the syntactic role describ ed at each row is

b oldfaced in the example

The order of precedence ab ove is only partial b ecause only the determiner and the head are obligatory

All the others are optional mo diers They can all coo ccur together with one exception the quantier is

grammatical syntactic example

function category

PreDeterminer all of his rst ten p oints

Determiner article a victory

demonstrative Pronoun this victory

question Pronoun what victory

p ossessive Pronoun their victory

NP New Yorks victory

Ordinal simple Numeral the third victory

Cardinal simple Numeral seven victories

Quantier Twice as many p oints

Describ er Adjective an easy victory

present Participle a smashing victory

past Participle a hard fought victory

Classier Noun a road victory

Adjective a presidential victory

present Participle a winning streak

Head common Noun a victory

Qualier PP a victory over the Nets

relative S a victory that will b e rememb ered

toinnitive S a victory to b e rememb ered

fortoinnitive S a victory for them to grab

Figure B NominalInternal grammatical functions and llers in surge

exclusive with the ordinal andor cardinal The three example nominals b elow illustrate the simplest and

most complex patterns of syntactic roles

victories

Determiner Head

half of their rst four hard fought overtime victories on the road

Predeterminer Determiner Ordinal Cardinal Describer Classier Head Qualier

half of their many hard fought overtime victories on the road

Predeterminer Determiner Quantier Describer Classier Head Qualier

The same semantic element can b e conveyed by dierent syntactic roles For example a p ossessive relation

can b e expressed by mapping the p ossessed entity onto the head and the p ossessor onto either the determiner

genitive case as in

the teams title

Determiner Head

Possessor Possessed

or the qualier as in

the title of the team

Head Qualier

Possessed Possessor

For the expression of such a p ossessive relation this nominalrank alternation parallels the verb valency

alternation at the clause rank allowing fronting and thus fo cusing dierent elements of the relation

The team owns the title

Subject Verb DirectObject

Possessor Possessed

The title b elongs to the team

Subject Verb DirectObject

Possessed Possessor

What syntactic role a given sub constituent lls within the overall nominal structure can in most cases

b e straightforwardly determined from its syntactic category and its relative p osition to the head andor other

mo diers The nontrivial case arises where a single adjectival premo dier o ccurs b etween the determiner

sequence and the head Is this adjective a describ er or a classier The test to take that decision is to

attempt to use the premo dier in a synonymous attributive clause If it is also valid in such attributive

usage it is a describ er otherwise it is a classier So for example the validity of

the victory was easy tells us that easy in an easy victory is a describ er

while the invalidity of

the victory was presidential tells us that presidential in a presidential victory is a classier

From a communicative goal p ersp ective the classier tends to bring information assumed known to

the hearer and allowing referent identication while the describ er tends to bring new information ab out a

referent already identied by the hearer

Note that in Fig B the only p ossible ller for the head is a common noun This is b ecause surge

treats proper nominals as an entirely dierent syntactic category from the common nominals describ ed so

far These prop er nominals can only consist of an unmo died prop er noun eg Orlando More complex

prop er nominals with mo diers eg streaking Orlando can therefore not b e generated by surge I

come back to this limitation in section B

B Rest of the syntactic grammar

The only asp ect of surge relevant here outside the clause and nominal subgrammars is the system

dealing with paratactic complexes This system is orthogonal to the dierent subgrammars dealing with

the sp ecics of each syntactic category It thus treats paratactic complexes generically They are three typ es

of such complexes in surge conjunctions app ositions and lists

Conjunctions and app ositions satisfy the constraint that each element in the complex needs to b e of the

17

same syntactic category whereas the third is for syntactically heterogeneous complexes Syntactically

conjunctions dier from app ositions in that a co ordination conjunction links the p enultimate element to the

last one In an app osition only a punctuation mark separate these elements Semantically conjunction is a

versatile construction that can express a vast array of dierent relations among its elements including logical

causal temp oral and spatial ones The precise relation can only b e inferred from the context App osition

requires its elements rarely more than two to b e coreferent

Figure B contains an example input description for each typ es of paratactic complex It also contains

several features of the fufsurge package worthwhile noticing at this p oint since they app ear again in the

streak domain examples shown later in this chapter

The presence of a complex attribute indicates that the constituent in question is a paratactic complex and

its value sp ecies which typ e of complex The attribute distinct contains the list of elements aggregated in

18

the complex Since in fuf the value of an attribute can only b e an atom or recursively a list of attribute

value pairs lists are internally represented as carcdr pairs The following distinct is a fuf

macro that renders FDs with list values more legible For example the FD a

will internally expand as a car cdr car cdr none

Note how in the app osition example in the center of Fig B the toplevel category np is sp ecialized into

17

Or more precisely of categories related by subsumption since syntactic categories are organized in a hierarchy in surge

18

Except for metaattributes like cset pattern etc

Orlando nipped Toronto and trashed Milwaukee

cat clause

complex conjunction

distinct process type material lex nip

partic agent cat proper lex Orlando

affected cat proper lex Toronto

process type material lex trash

partic agent cat proper lex Orlando

affected cat proper lex Milwaukee

Orlando the division leader

cat np

complex apposition

distinct cat proper lex Orlando

cat common classifier lex division head lex leader

a victory over Toronto that extended Orlandos streak to five games

cat np

definite no

head lex victory

qualifier cat list

distinct cat pp

prep lex over

np cat proper lex Toronto

cat clause

tense past

process type composite relationtype locative lex extend

mood relative

scope partic agent

partic affected cat common

determiner cat proper lex Orlando

head lex streak

located affected

location cat pp

prep lex to

np cat common

cardinal value

head lex game

Figure B Example input descriptions of paratactic complexes in surge

proper and common in each element of the app osition When an element do es not contain a cat feature it

inherits the corresp onding feature from the toplevel of the complex This is the case in the clause conjunction

example at the top of the gure

The relative clause of the list example at the b ottom of the gure contains the feature scope In a

relative clause one of the semantic participants is the entity that is referred to by the very nominal that the

relative clause qualies The scope feature indicates which participant this is Without this feature this

input would not unify with surge b ecause it lacks the Agent participant which is required for comp osite

pro cesses of typ e materiallo cative The fact that the list complex lls a nominalinternal syntactic role is

not fortuitous One of the main usage of cat list in surge is to allow several describ ers classiers or

qualiers that all directly mo dify the NP head instead of each other

B SURGE extensions to complex sentences and quantita

tive nominals

In this section I present the extensions I made to the initial version of surge that was presented in the

previous section These extensions were originally motivated by the fact that most sentences observed in the

corpus of humanwritten sp orts rep orts contained syntactic constructs b eyond the coverage of surge

These constructs fall essentially into two main classes adverbial relations that link several clauses together

inside complex sentences and sp ecial nominals that refer to quantities and their historical signicance These

gaps in surges coverage were thus not sp ecic to the particular sp orts domain used as a testb ed for

the system implementation asp ect of this research They more generally concerned any application in which

either a sizeable variety of complex sentences or comparative references to quantities need to b e generated

In extending surge I did not consider only the set of constructs strictly necessary for the particular

application of streak I instead considered the two sets of missing constructs adverbial complements

and quantitative nominals in a comprehensive manner For adverbial complements I synthesized several

descriptive noncomputational linguistic works on these issues within the unifying computational framework

of surge For quantitative nominals faced with the dearth of linguistic work mentioning them I essentially

came up with my own treatment after analyzing their usage in two dierent domains sp orts and the sto ck

19

market

The systematic review of these two sets of constructs insured the robustness of the resulting extended

version of surge It represents a signicant contribution to the growth and maturing of the fufsurge

package as a p ortable syntactic pro cessing frontend for generation expanding its widecoverage from the

simple clause rank b oth down to the nominal rank and up to the complex sentence rank The extended

surge has since b een used for the development of several others generation applications at Columbia

University including automated do cumentation Kukich et al verbal descriptions of visual scenes

Ab ella and business letters The extensive coverage of adverbial complements provided by surge

proved extremely useful in these three other domains

In what follows I start by illustrating the various gaps in the coverage of surge that were the ob ject

of an extension using an example sentence from the corpus of humanwritten sp orts summaries I then

review in turn the various extensions done to the clause and nominal subgrammars of surge

B Starting p oint

20

Consider the following lead sentence extracted from the corpus of humanwritten sp orts summaries

East Rutherford NJ Charles Barkley scored a season high points and made of

three point shots to add Friday night helping the surging Suns pul l out a victory over the New

Jersey Nets losing their third home game in a row against Phoenix

This sentence contains several syntactic constructs not covered by surge First consider the adverbial

innitive clause helping in a row forming the entire second half of the sentence It cannot b e generated

by surge for two dierent reasons

The semantic relation that this attachment realizes The only typ es of sub ordinate adverbial present

participle clauses covered by surge are those whose function is to indicate the time of the actions

conveyed by the matrix The helping in a row dep endent clause do es not match this typ e of

clause neither semantically since this time is indicated by the NP Friday night just preceding this

dep endent clause nor syntactically since it do es not start by a temp oral sub ordinating conjunction

eg after before as required of temp oral background clauses

19

Numerous discussions with MElhadad and BPassoneau ab out these constructions have tremendously help ed form the

analysis I prop ose here

20

Strictly sp eaking this sentence did not actually o ccur in the corpus For the sake of presentation it was rather reconstructed

from pieces of dierent sentences each of them actually o ccurring in the corpus This reconstruction conveniently allows every

class of syntactic constructs that required an extension to surge to b e illustrated on this single example

The nature of the matrix constituent onto which it is attached It is not solely Ma jerles p erformance

that help ed the Suns win but rather b oth Barkleys and Ma jerles p erformances The helping

adverbial clause therefore mo dies the conjunction of clauses forming the entire rst half of the sentence

Charles Barkley and Dan Majerle adding In surge adverbial elements could only

separately mo dify individual clauses not globally mo dify paratactic complexes

Therefore b oth the paratactic complex subgrammar and the part of the transitivity system handling

adverbial roles in the clause subgrammar must b e extended The direct ob ject of this helping in a

row dep endent clause is itself a clause pul l out in a row This bare variety of innitive clause which

lacks the frontal to was not covered by surge The mo o d system of the clause subgrammar must

therefore b e extended as well

Now consider the referring expressions in sentence The losing team is referred to as the New Jersey

Nets losing their where the prop er noun New Jersey Nets is p ostmo died by the qualifying present

participle clause losing their surge did not cover nominals of this form since it allowed only

the relative and innitive mo o d for qualifying clauses and more imp ortantly it did not allow mo dication

of prop er nouns Note also the premo dier in the rst reference to the winning team the surging Suns

This winning team is mentioned twice but only half of its complete prop er noun is used each time the Suns

rst and Phoenix subsequently The treatment of prop er nouns must thus also extend to such comp ound

cases

Now consider the ob ject of the losing teams p ostmo difying clause Using the nominalinternal syntactic

roles of surge it could b e analyzed as follows

their third home game in a row

Determiner Ordinal Classier Head Qualier

Semantically this analysis is erroneous b ecause the PP in a row do es not mo dify the head noun game

like the PP against Phoenix do es This can b e shown by the contrast b etween

the p erfectly ne a home game against Phoenix

and the nonsensical a home game in a row

Instead in a row really mo dies the ordinal third as in the synonymous

their third straight home game

a fact obscured in due to the discontinuous nature of the mo dication Neither ordinal mo dication nor

any typ e of nominalrank discontinuous mo dication was covered in surge

Another class of nominals not covered by surge yet p ervasive in quantitative domains is illustrated

by Barkleys scoring statistic a season high points in sentence

The bracketing a season high points where season high and are simply analyzed as a

classier and cardinal of the head points is invalid since the cardinal must always precede the classier in

an NP

Similarly the bracketing a season high points where season high is analyzed as a complex

cardinal similarly to the third straight complex ordinal ab ove is also invalid since a indenite determiner

like a cannot coo ccur with a cardinal in an NP

The only p ossible bracketing is thus a season high points requiring the coverage of a new typ e

of quantitative NP head called a measure Such a sp ecialized quantitative category must also b e allowed to

app ear as classier as shown by the game result NP

a victory

The last syntactic construct of the sentence ab ove not covered by surge are partitives such as

Ma jerles sho oting p erformance in sentence of three point shots

In conclusion generating sentences like sentence requires extending

The set of semantic relations syntactic categories and matrix constituents of adverbial complements

The set of clausal mo o ds

The set of syntactic categories app earing as nominal head ordinal classier and qualier

The set of comp ound syntactic categories

B Extensions to the clause subgrammar

In this section I present the extensions I made to the clause subgrammar of surge The goal of these

extensions is to expand the widecoverage of surge from the single clause rank to the complex sentence

rank It thus fo cuses on the subpart of the transitivity system dening the various adverbial constituents

of the clause Among the various adverbial meanings it further fo cuses on those realizable by either a PP

a nominal or a sub ordinate clause Adverbial meanings which are realizable only by way of an adverb were

thus excluded from this study since one of the recurrent theme of this thesis is crossranking paraphrasing

to which semantic classes realizable by a single syntactic category have no imp ort

There were several sources for candidate adverbial complement classes to cover in surge

The subcorpus of humanwritten summaries whose indepth analysis was presented in section

The comprehensive lists of adverbial classes indep endently presented from dierent angles in Chapters

and of Quirk et al

The more limited lists of adverbial classes presented in Halliday Thompson and Longacre

and Talmy

In general I considered only forms that o ccurred indep endently in two of these sources considering each

chapter of Quirk et al as an indep endent source In what follows I do not present the painstaking

derivation and crossexamination pro cess that this study involved I instead presents the end result from

the p ersp ective of a surge user switching from the to the version The linguistic issues and concepts

related either to the adverbial classes entirely new to the version or to the new treatment of classes

partially covered the version are discussed in some detail

B Syntactic typ es of adverbial constituents

There is an asymmetry b etween the treatment of participants and the treatment of circumstantials in surge

Participants are thematic roles that are mapp ed onto syntactic roles such as subject verb directobject

These syntactic roles in turn app ear in the pattern metaattributes indicating to the linearizer the syntactic

constituent structure of the sentence to generate In contrast circumstantials are thematic roles that are

not mapp ed on any syntactic roles The linearizer knows which should app ear in the sentence string and

where b ecause the pattern features contains paths that p oints directly to the circumstantial thematic roles

themselves So for example the toplevel pattern for the sentence Orlando defeated Toronto yesterday

where yesterday is a Time circumstantial is

pattern

dots syntroles subject dots syntroles verb dots syntroles directobject dots

circum time

This approach assumes that there is no syntactic distinction among the various circumstantials However

Quirk et al distinguish b etween several classes of circumstantials that they call adverbials each

characterized by a distinct syntactic b ehavior First they distinguish among adjuncts disjuncts subjuncts

and conjuncts Then they further distinguish b etween predicate adjuncts and sentence adjuncts

B Adjuncts vs Disjuncts Sub juncts and disjuncts primarily convey mo dality and discourse

cues and can only b e realized by adverbs eg really only then and never by NPs PPs or clauses

They are thus b eyond the scop e of the present extensions and are were not implemented in surge In

contrast surge incorp orates the distinction b etween adjuncts and disjuncts Quirk et al give

21

four tests to distinguish adjuncts from disjuncts Only adjuncts can

21

And in fact from sub juncts and conjuncts as well

Be clefted

App ear in an alternative question

App ear within the scop e of a fo cusing sub junct like only

Be elicited by a question

I illustrates these four test on the following pair of example sentences

Bo missed the game b ecause he was injured

Since he was injured Bo missed the game

It is interesting to note that these two sentences are synonymous and that b oth contain a Reason adverbial

in b old However the dep endent clause he was injured functions as an adjunct when linked to the matrix

clause by the sub ordinating conjunction because but it functions as a disjunct when linked to the matrix

by the sub ordinating conjunction since as shown by the four following tests

Clefting

It was b ecause he was injured that Bo missed the game

It was since he was injured that Bo missed the game

Alternative question

Did Bo miss the game b ecause he was injured or b ecause he was susp ended

Did Bo miss the game since he was injured or since he was susp ended

Focusing subjunct

Did Bo miss the game only b ecause he was injured

Did Bo miss the game only since he was injured

Elicited by question

Why did Bo miss the game Because he was injured

Why did Bo miss the game Since he was injured

The example ab ove underlines the imp ortance of making systematic distinctions among adverbials the

decision by surge to map a Reason circumstantial sp ecied in its input onto an adjunct or disjunct is b oth

lexically and syntactically constrained

B Predicate vs Sentence Adjuncts Predicate adjuncts are distinguished from sentence

adjuncts on b oth semantic and distributional grounds Semantically a predicate adjunct mo dies only the

verb of the clause whereas a sentence adjunct mo dies the whole clause This can b e prob ed by using the

question form What X did to Y was to Z as illustrated b elow Predicate adjuncts are thus more nuclear

to the clause than sentence adjuncts In fact as shown in Fig B the clause can b e analyzed as a set of

successive layers from the verb which is at the core to disjuncts at the outer frontier with the intermediate

syntactic functions in b etween

Distributionally predicate adjuncts cannot app ear in frontal p osition without the intention to pro duce

a striking rhetorical eect appropriate only in very restricted contexts In contrast sentence adjunct can

b e freely fronted without aecting the rhetorical eect of the sentence Moreover when b oth a predicate

adjunct and a sentence adjunct coo ccur in trailing p osition in the same clause the predicate adjunct must

app ear rst The two sentences b elow illustrate the dierence b etween predicate and sentence adjuncts

She kissed Bo on the cheek predicate adjunct

She kissed Bo on the platform sentence adjunct

The contrast b etween the predicate adjunct of and the sentence adjunct of is underlined by the

following test transformation sentences Sentence Adjuncts

Oblique−N

Oblique−1

Verb

. . .

Predicate Adjuncts

Disjuncts

Figure B Nuclearity of clause constituents

First rhetorically

b On the cheek she kissed Bo She kissed Bo on the cheek

whereas

b On the platform she kissed Bo She kissed Bo on the platform

Then

c What she did to Bo was to kiss him on the cheek

d What she did to Bo on the cheek was to kiss him

whereas

c What she did to Bo was to kiss him on the platform

d What she did to Bo on the platform was to kiss him

Furthermore

a On the platform she kissed Bo on the cheek

b On the cheek she kissed Bo on the platform

and

c She kissed Bo on the cheek on the platform

d She kissed Bo on the platform on the cheek

Sentences ab ove show that adverbials of the same semantic class in the case at hand location can

b e realized by either a predicate or a sentence adjunct dep ending on their precise content Sentences ac

show the even more interesting fact that two adverbials with the same semantics can coo ccur in the same

sentence as long as one lls a predicate adjunct function and the other lls a sentence adjunct function

That is a is analyzed as

On the platform she kissed Bo on the cheek

Location Agent Process Aected Location

Sentence Adjunct Subject Verb Object Predicate Adjunct

The precise realization and coo ccurrence constraints of each adverbial semantic class are discussed in the

next section which review in turn each class implemented in surge I now explain the general scheme

for the input representation and syntactic realization of adverbials in surge The distinction b etween

Input specification of She kissed Bo on the cheek on the platform

cat clause

process type material lex kiss

partic agent cat personalpronoun gender feminine

affected cat proper lex Bo

predmodif location cat pp

prep lex on

np cat common lex cheek

circum location cat pp

prep lex on

np cat common lex platform

SURGE enrichments encoding the mapping to syntactic roles

oblique partic agent

partic affected

syntroles verb process

subject oblique

subject syntfunct subject

directobject oblique

directobject syntfunct object

endadverbial predmodif location

endadverbial syntfunct predadjunct

endadverbial circum location

endadverbial syntfunct sentadjunct

pattern

dots syntroles subject dots syntroles verb dots syntroles directobject

dots syntroles endadverbial syntroles endadverbial dots

Figure B Thematic to syntactic role mapping for adverbials in surge

disjuncts and adjuncts is purely syntactic and is thus internal to surge However since the distinction

b etween predicate and sentence adjunct is also semantic and dep ends on the particular meaning of the

op enclass words contained in the adverbial eg cheek vs platform in the example ab ove it must b e

enco ded in the input to surge Thus in addition to the toplevel features process partic and circum

of a surge input a surge input also contains a predmodif abbreviation for predicatemo dier

feature whose elements are mapp ed onto predicate adjuncts The elements under circum are in contrast

mapp ed onto either sentence adjuncts or disjuncts dep ending on a combination of semantic lexical and

syntactic constraints

The surge input for sentence a ab ove is shown at the top of Fig B Note that with surge

since all adverbials were app earing under circum in the input and since an FD cannot contain twice the same

attribute at a given level a sentence with two lo cation roles like a could not have b een generated The

way surge maps the four typ es of thematic roles it accepts in input pro cess participants predicate

mo diers and circumstantials onto syntactic roles is shown at the b ottom of the same gure The mapping

of the pro cess and participants remains unchanged from surge However while surge did not map

circumstantials onto any syntactic role surge does map b oth predicatemo diers and circumstantials

on a new class of adverbial syntactic roles

Each of the obligatory syntactic roles that realize participants has a distinct syntactic b ehavior In

systemic linguistic terms they fulll a dierent syntactic function abbreviated syntfunct Subject Verb

etc In contrast there are only three syntactic functions for adverbial syntactic roles PredicateAdjunct

SentenceAdjunct and Disjunct but several adverbial constituent with the same function can coo ccur in a

given sentence Therefore instead of b eing distinguished by syntactic functions as for the obligatory syntactic

roles adverbial syntactic roles are distinguished in term of their p osition in the linear clause pattern

In the current implementation there are seven dierent p ositions available for an adverbial constituent

two frontal p ositions b efore the verb and any of its arguments and ve trailing p ositions after the verb and all

its arguments In English a variety of medial p ositions lo cated b etween two elements of the verb argument

structure are also available alb eit in much more restricted ways than the frontal and trailing p ositions

Deciding on the appropriateness of placing a particular adverbial in one of these medial p osition involves

a complex combination of semantic syntactic and stylistic factors Since in most sentences all adverbials

app ear in either frontal or trailing p ositions I have left study of the complex constraints on medial p osition

placing for future work

To map adverbial thematic roles onto syntactic roles surge op erates in two stages

Taking into account the semantics syntax and lexical content of the input thematic role cho ose a

syntactic function

Taking into account the chosen syntactic function and the other adverbial constituents already mapp ed

onto syntactic roles cho ose one of the adverbial p ositions

All predicatemo diers are mapp ed onto predicateadjuncts For most circumstantials the choice b etween

a sentenceadjunct and a disjunct is entirely predetermined on semantic grounds For example the Time

thematic role can only b e mapp ed onto a sentence adjunct and the Condition thematic role can only b e

mapp ed onto a disjunct For others it dep ends on syntactic constraints For example the Manner thematic

role must b e mapp ed onto a sentence adjunct when lled by an adverb and onto a disjunct when lled by a

PP This choice can also dep end on lexical constraints For example a sub ordinated clause lling the Reason

thematic role must b e mapp ed onto a sentence adjunct when linked to the matrix clause by the sub ordination

conjunction because but onto a disjunct when linked to the matrix clause by the sub ordination conjunction

since Since surge works bidirectionally either side of such constraints can b e sp ecied in the input and

the opp osing side will b e enforced in the output by unication

Once the syntactic function of each adverbial has b een chosen surge then maps each of them onto

sp ecic slots in the precedence pattern of the clause in the following order predicate adjuncts rst then sen

tence adjuncts then disjuncts For a predicate adjunct only the trailing slots are available and the leftmost

available one is taken For a sentence adjunct or a disjunct ve trailing and two frontal slots are available

The innermost available one is chosen where the innermost order is dened as follows

end front end end end end front

This default placement can b e overro de by explicitly restricting a given circumstantial to either frontal

or trailing p osition in the input Each adverbial thematic role present in the input sp ecication is pro cessed

by the algorithm ab ove in the following order

For predicate mo diers Score Manner Instrument Comparison Matter Distance Location Path Origin

Destination Duration

and for circumstantials Manner Accompaniment Opposition Matter Standard Perspective Behalf Dis

tance Location Duration Frequency Time Means Reason Purpose Addition Comparison Concession

Contrast Exception Substitution Condition ConcessiveCondition Inclusion CoEvent Result

Each of these thematic roles is dened and exemplied in the next subsection

B Extending the coverage of adverbial thematic roles

The extended coverage of adverbial thematic roles and their syntactic realization provided by surge is

given in gures B B and B The rst of these three gures contains the thematic roles realizable

by predicate adjuncts the second contains those realizable by sentence adjuncts and the third contains those

realizable by disjuncts Each of these gure is a table with one line for each syntactic realization available

for each semantic typ e of adverbial covered The semantic typ e is sp ecied rst in terms of a general

semantic syntactic Example

class role feature category

Lo cative Lo cation Adverb Bo kissed her there

PP Bo kissed her on the cheek

nite S Bo kissed her where she wanted

verbless S Bo kept the keys where convenient

Direction Adverb Bo slid down

NP Bo drove this way

PP Bo ran up the hill

Destination Adverb Bo went home

PP Bo sent the letter to LA

nite S She sent Bo back where he b elongs

Path PP Bo traveled via Denver

Distance NP Bo came a long way

Temp oral Duration Adverb Bo stayed there forever

NP Bo stayed there a long time

Pro cess Manner PP Bo kiss her with love

Means Adverb Bo was treated surgically

PP Bo was treated by surgery

Instrument p olar PP Bo pushed him with b oth hands

p olar PP Bo negotiated without an agent

Comparison nite S Bo went there as he did yesterday

innitive S Bo played hard as if to send the fans a message

ing S Bo played great as if p eeking on schedule

ed S Bo played hard as if not b othered by his knee

verbless S Bo played great as if in great form

Resp ect Matter PP Bo talked ab out his contract

Domain Score Score New York b eat Indiana

Figure B Predicate Adjuncts in surge

class of semantic relation then a sp ecic thematic role b elonging to that class and nally any nergrained

distinction enco ded as a semantic feature app earing inside the role An example sentence is given for each

p ossible realization with the sentence constituent corresp onding to the input thematic role in b old

The semantictyp e syntactic realization pairs that were already covered by surge are indicated

by a star in the last column of the table Quantitatively the extensions increased the numb er of such pairs

covered by surge from to Qualitatively the extensions acted in four dierent ways First they cover

entirely new semantic classes of adverbials For example there was no conditional thematic role available

in surge Second they added new memb ers of partially covered classes For example the Addition

Opposition Exception Inclusion and Substitution thematic roles were added to the determinative class of

adverbials which initially consisted of only one role Accompaniment Third they completed the set of

syntactic realizations of thematic roles that were only realizable by one or two syntactic categories in surge

For example whereas lo cations could b e expressed only as PPs they can now also b e expressed as

adverbs nite clauses and verbless clauses Fourth they created a comprehensive framework in the light

of which the few adho c treatments that surge contained were revealed and rethought namely the

lo cative and temp oral adverbials as discussed in the paragraph dedicated to them in what follows

There are eight semantic classes of adverbials in surge lo cative temp oral causative conditional

pro cess determinative resp ect and domainsp ecic I review the thematic roles of each of these classes in

turn in the following paragraph

semantic syntactic Example

class role feature category

Lo cative Lo cation Adverb There Bo kissed her

PP On the platform Bo kissed her

Origin PP From Kansas City Bo called Gwen

Distance PP For a few miles the road is damaged

Temp oral Time Adverb Yesterday Bo triumphed

NP Last year Bo triumphed

PP On Monday Bo triumphed

nite S As he received the ball Bo smiled

ing S After receiving the ball Bo smiled

ed S Once injured Bo grimaced

verbless S Once on the o or Bo grimaced

Duration PP For a long time Bo stayed here

nite S Until Bo recovered they waited for him

ing S Since joining the Raiders Bo shines

ed S Until forcibly removed they will stay

verbless S As long as necessary they will stay here

Frequency Adverb Often Bo scored twice

NP Twice a week Bo runs miles

PP On Sundays Bo run miles

Causative Reason PP Because of injury Bo did not play

nite S Because he was injured Bo did not play

Purp ose PP For enough money Bo would play anywhere

nite S So he would get a raise Bo held out

innitive S To gain freeagency Bo held out

Behalf PP For the Raiders Bo scored twice

Determinative Addition ing S In addition to trading Bo they waived Mike

Accomp p olar PP With her Bo would go anywhere

animent p olar Without her Bo wouldnt go anywhere

Opp osition PP Against the Knicks the Nets are

Pro cess Manner Adverb Tenderly Bo kissed her

Means ing S By acquiring Bo they strengthen their defense

Comparison p olar PP Like Mike Bo soared ab ove the rim

p olar PP Unlike Mike Bo can bat

Figure B Sentence Adjuncts in surge

B Lo cative adverbials There are six lo cative adverbial roles in surge Location Direction

Destination Origin Path and Distance They resp ectively answer the questions where in what direction

to where from where throughacross where and how far

Syntactically they can surface as predicate adjuncts sentence adjuncts or b oth They can all coo ccur

in the same clause as in

In France Bo biked south miles along the Rhone from Lyon to the Camargue

Location Direction Distance Path Origin Destination

The distinction among these surge lo cative roles here is semantic as opp osed to the lexically based

distinction used in surge Characterizing lo cative thematic roles by spatial prep ositions is problematic

for two reasons

Most lo cative meaning have alternative nonprep ositional realizations

There are many spatial prep ositions in English and their resp ective meanings overlap in a complex

context sensitive ways Herskovits

semantic syntactic Example

class role feature category

Temp oral CoEvent habit nite S Whenever you hurt call me

Causative habit ing S With his knees hamp ering him Bos defense is sloppy

Blend ed S Injured against the Giants Bo didnt play

verbless S Unable to play due to injury Bo stayed home

habit verbless S Whenever in doubt call me

Causative Reason nite S Since he was injured Bo did not play

Result nite S They wouldnt give him a raise so Bo held out

innitive S They waived Bo only to see him ourish elsewhere

Conditional Condition p olar nite S If he is fully t Bo will play

ed S If wellconditioned Bo will play

verbless S If in great shap e Bo will play

p olar nite S Unless he is fully t Bo wont play

ed S Unless wellconditioned Bo wont play

verbless S Unless in great shap e Bo wont play

Concessive nite S Even if he is fully t Bo wont play

Condition ed S Even if wellconditioned Bo wont play

verbless S Even if in great shap e Bo wont play

Concession PP In spite of his injury Bo played

nite S Although he was injured Bo played

ed S Although injured Bo played

ing S Although hurting Bo played

verbless S Although out of shap e Bo played

Determinative Contrast PP As opp osed to many others Bo never held out

nite S Whereas Mike keeps sho oting Bo prefers passing

Exception PP Except Laettner all Olympian were pros

ing S Except for blo cking shots Bo can do it all

Inclusion PP Bo scored p oints including threep ointers

ing S Bo can do it all including blo cking shots

verbless S Bo played everywhere including in New York

Substitution PP Instead of Mike they picked Bo

ing S Rather than drafting Mike they picked Bo

innitive S Rather than sho ot Bo made the p erfect pass

bare

Addition PP They picked Bo as well as Mike

Resp ect Matter PP Concerning Bo they were wrong

Standard PP For a center Bo is very quick

Persp ective PP As a scorer Bo remains prolic

Figure B Disjuncts in surge

It is thus far preferable to dene thematic roles by conceptual classes of meanings indep endently of any

sp ecic realization or lexical item using coo ccurrence inside the same clause as a dierentiating criteria

The AtLoc InLoc and OnLoc roles of surge all corresp ond to the PP realization of the single Location

role of surge The FromLoc and ToLoc roles of surge resp ectively corresp ond the PP realization

of the Origin and Destination roles in surge surges interpretation of NPs as PPs in the input

sp ecication of adverbials was p ossible only b ecause it was not covering any NP realization of adverbials

surge still cho oses a default prep osition for PP realizations of adverbials but the category sp ecied in

the input for a PP must b e PP and not an NP

B Temp oral adverbials There are three temp oral adverbial roles in surge Time Dura

tion and Frequency They resp ectively answer the questions when how long and how often They can

all coo ccur in the same clause as in

Bo works out four hours every day this month

Duration Frequency Time

The Time role of surge reies the Time and TemporalBackground roles of surge whose sole

distinction app eared to b e that the former were realized by adverbs and PPs and the latter by clauses

B Causative adverbials There are four causative adverbials in surge Reason Purpose

Result and Behalf who all answer the questions why and individually answer the questions for what reason

for what purpose with what result on whose behalf These four thematic roles are mutually exclusive

Result can only b e realized by a clause and only app ear in trailing p osition Behalf can only b e realized by

a PP whose NP complement is animate Reason and Purpose can b oth b e realized by either a clause or a

PP They can b e easily distinguished by the dierent sub ordination conjunctions or prep ositions used to link

their content to the matrix clause since or because and because of for Reason in order to or so

that and for for Purpose

B CoEvent clauses The CoEvent adverbial forms a semantic class on its own since it is used

to link two events related by an unsp ecied relation that blends causative temp oral and sometimes even

lo cative elements It can only b e only realized by clauses though by a variety of clausal forms It can only

ll the disjunct syntactic function I distinguish b etween two typ es of CoEvents habitual ones as in

Whenever there is smoke there is re

and regular ones as in

Injured against the Giants Bo did not play

22

Habitual CoEvent clauses express a recurrent pattern of coo ccurrence whether eective or desired

b etween the event they convey and the event conveyed in the matrix clause They can b e either nite or

verbless

23

In contrast regular CoEvent clauses express a isolated coo ccurrence b etween the event they convey

and the event conveyed in the matrix clause They can b e either participial or verbless

The fact that CoEvent clauses are an unsp ecied causativetemp oral blend is well illustrated by the

following examples

a Before there is smoke there is re

b There is smoke b ecause there is re

a After he got injured against the Giants Bo did not play

b Because he got injured against the Giants Bo did not play

ab ove can b e equally well interpreted as synonymous to the temp oral a and the causative b

Similarly ab ove can b e equally well interpreted as synonymous to the temp oral a and the causative

22

Called contingency clauses in Quirk et al pp

23

Called absolutive clauses by Thompson and Longacre pp and suppletive clauses by Quirk et al

pp

b Their versatility and conciseness make CoEvent clauses a construct of choice in the corp ora of human

written summaries I analyzed

B Conditional adverbials There are three conditional adverbials in surge Condition

ConcessiveCondition and Concession and they can surface only as disjuncts Condition and Concessive

Condition can only b e realized by clauses Concession can also b e realized by a PP The p olarity feature

distinguishes b etween p ositive conditions whose default sub ordinator is if and negative conditions whose

default sub ordinator is unless Further sub divisions of condition clauses such as the predictive vs hypo

thetical vs counterfactual distinctions discussed in Thompson and Longacre interacts tightly with

the mo dality system of the grammar and were thus left for a future round of extension

B Pro cess adverbials There are four pro cess adverbials in surge Manner Means

Instrument and Comparison They all answer the question how and sp ecify in some way the nature of the

pro cess by which the event describ ed in the matrix clause o ccurred They are all adjuncts Dep ending on

the syntactic category that realizes them the Manner Means and Comparison thematic roles can b e either

predicate or sentence adjuncts Instrument can only b e realized by a predicate adjunct PP All four pro cess

adverbials coo ccur in the same clause as in

Shrewdly Bo went home by train with his freeticket like he did yesterday

Manner Means Instrument Comparison

B Resp ect adverbials There are three resp ect adverbials in surge Matter Standard

and Perspective all realizable only by PPs Matter sp ecies the topic which the prop osition conveyed by

the matrix clauses must b e interpreted as referring to Standard sp ecies a comparative reference frame for

evaluating the prop osition conveyed by the matrix clause Perspective sp ecies the particular role or function

of the entity mentioned in the matrix clause that is relevant for the interpretation of the prop osition conveyed

by that clause These three adverbials are mutually exclusive in a given clause

B Determinative adverbials There are seven determinative adverbials in surge Accom

paniment Addition Inclusion Substitution Exception Opposition and Contrast Although several of them

are describ ed by dierent authors there seem to b e very little agreement with resp ect to which semantic

class of adverbial meanings each of these thematic roles b elongs to I decided to group all of them under

the determinative lab el b ecause although they form a semantically diverse set they all function at the

clause rank in a manner similar to a determiner at the nominal rank they circumscrib e the participants of

the matrix clauses in the context of a larger set of domain entities Determinative adverbials can b e realized

by either PPs or clauses lling either a sentence adjunct or disjunct function

B Domainsp ecic adverbials Finally surge incorp orates the Score adverbial which is

sp ecic to the particular sp orts domain of streak This example illustrates that even if comprehensive and

thorough a general p ortable set of thematic roles may still require minor extensions when applied to a new

application domain

B Mo o ds of sub ordinated adverbial clauses

The clausal realizations of the thematic roles of surge includes three typ es of clauses not covered in

surge

Bound adverbial clauses

Verbless clauses

Bare innitive clauses

The mo o d system thus need to b e extended as shown in Fig B

The mo o d bound designates nite sub ordinate clauses Inside the matrix clause these sub ordinate clauses

can ll either a nominal function ie Subject Object or an adverbial one ie Adjunct Disjunct As

shown in Fig B surge covered only the nominal variety of b ound clauses and the nonnite variety of

adverbial clauses In b ound adverbial clauses the sub ject is often control led by either the sub ject or ob ject

of the matrix clause In this context controlled means omitted but understo o d as coreferent Controlled

sub ject in b ound clauses are similar to scop ed sub ject in relative clauses cf section B In surge

they are handled by a path valued feature control indicating which thematic role in the sub ordinate

clause is to b e syntactically omitted The description of this thematic role reduces to a path p ointing to

the controlling thematic role in the matrix When surge sees such paths while p erforming sub jectverb

agreement it then knows that the sub ject must b e gap ed and by following the path chain which matrix

syntactic role contains the relevant features for the agreement An example input of b ound adverbial clause

with controlled sub ject is given in Fig B All the features relevant to agreement are centralized under

the index features

The reason for leaving the sub ject implicit in dep endent clauses is to achieve conciseness There is an

even more concise form of dep endent clauses called verbless clauses in which the verb itself is omitted

Such clauses can in general b e equally well interpreted as realizing dierent pro cess typ es For example the

verbless clause in italics in the table of Fig B can b e equally well interpreted as an abbreviated form of

the clause

Although playing without Ewing

or of the clause

Although they were without Ewing

In the pro cess typ e is material and without Ewing is an accompaniment adjunct In the pro cess

typ e is locative and without Ewing is an accompaniment complement Since each verbless clause has a

valid relational interpretation I treat them as such is surge

The third new mo o d of surge is bareinnitive the variety of innitive clauses that can b e directly

attached to the matrix clauses without the need for the sub ordinating conjunction to

nite b ound adverbial Ewing scored p oints as the Knicks won the game

nonnite verbless Although without Ewing the Knicks won the game

innitive bare Ewing help ed the Knicks win the game

Figure B Extensions to the mo o d system

Dan Majerle made of three point shots to add

cat clause

tense past

process type material effecttype creative lex make

partic agent cat personname firstname lex Dan lastname Majerle

created cat partitive

part cat cardinal value digit yes

partof cat common

cardinal cat cardinal value digit yes

classifier cat measure

quantity cat cardinal value

unit cat noun lex point

head cat noun lex shot

circum result cat clause

mood toinfinitive

process type material effecttype creative lex add

controlled partic agent

partic agent index partic agent index

created cat measure value unit gap yes

Figure B Example input description with controlled sub ject

B Extensions to the nominal subgrammar

The extensions to the nominal subgrammar from surge to surge were sizeable and are summedup

in the table of Fig B They were carried out in two distinct ways denition of new nominal subcategories

and denition of new mo dication relations among existing categories

The three new categories are measure nouncompound and partitive Measure is a category sp ecially

designed for statistics It has two attributes quantity whose value must b e a cardinal and unit whose value

can b e either a noun or a nouncomp ound Measure can app ear b oth as a head or as a classier Noun

comp ound handles deep cases of nounnoun mo dications It can app ear anywhere in place of a noun It is

has two attributes classier and head Each can recursively b e a nouncomp ound Even though classier

and head are two attributes of the full NP this nouncomp ound category is needed b ecause these two feature

24

cannot app ear without a determiner in full a NP The partitive category handles fractional measures and

prop ortions It has two attributes part whose value can b e a cardinal an ordinal a measure or a full NP

and partof whose value can b e either a measure or a full NP It can app ear anywhere instead of a measure

The input sp ecication of a complex nominal example containing sub constituents of the three categories just

dened in given in Fig B Figure B contains another example input of partitive constituent

Two existing categories that were inherently atomic in surge accept mo diers in surge ordinals

and prop er nouns Ordinals can b e mo died either by adjectives cf rst row in Fig B or discontinually

by PPs cf second row in Fig B Prop er nouns can b e mo died by the same range of mo diers than

common nouns In addition they can also have their own internal comp ound structure which in a way

parallels the new nouncomp ound category intro duced for common nouns These two new prop erties of

prop er nouns are illustrated by the example b elow

The surging New York Knicks who have won three straight

Determiner Describer Head Qualier

Article Participle Compound Proper Relative Clause

Home Franchise

B Extensions to the rest of the grammar

In surge paratactic complexes could not b e mo died in any way surge allows conjunctions or

app ositions of clauses to b e mo died by adverbials The adverbial is attached using a predmodif or circum

feature exactly as in the case of a simple clause A subgrammar for dates and one for addresses were also

incorp orated to surge

B Summary

surge extends the wide coverage of surge at the simple clause rank b oth down to the nominalrank

and up to the complex sentence rank The extensions have b een based b oth on corpus data and descriptive

linguistic works They are comprehensive and insure the robustness of surge for any quantitative

domain even those including the most complex and concise constructions of English Beyond streak

surge has b een used for developing generation applications in three other domains and I exp ect it

to so on b ecome the ocial version of surge for distribution of the fufsurge generation environment

package

24

Even though plural count NPs may have an empty determiner cf Fig B

grammatical syntactic example

function category

Ordinal Numeral phrase their third straight victory

discontinuous Numeral phrase their third victory in a row

Classier Noun comp ound a franchise record victory

Measure a p oint p erformance

Head prop er Noun the hapless Denver Nuggets who lost again

Noun comp ound his league season high

Measure a career high p oints

Partitive a season b est of shots

Qualier presentparticiple S a victory invigorating the Knicks

pastparticiple S a victory hardfought until the end

Figure B Extensions to the nominal grammar

a franchise record percent of their three point attempts

cat common

definite no

classifier cat nouncompound

classifier lex franchise

head lex record

head cat partitive

part cat measure

quantity value

unit lex percent

partof cat common

number plural

determiner cat personalpronoun

classifier cat nouncompound

classifier cat measure

quantity value digit no

unit lex point

noun cat nouncompound

classifier lex field

head lex goal

head lex attempt

Figure B Input description of a complex nominal in surge

B Limitations of SURGE

The wide coverage of surge at the four main linguistic ranks determiner nominal simple clause and

complex sentence makes it the most comprehensive p ortable syntactic frontend for generation application in

English available to day It remains however incomplete and requires further extension in a variety of ways

In the following paragraph I overview the most needed extensions in each ma jor system of the grammar

B Transitivity

As explained in Section B there are three main classes of clauses those expressing an event those

expressing a relation and those called comp osite expressing b oth with the event as cause and the relation

as consequent There are also three main classes of events material mental and verbal Although all three

typ es can take part in a comp osite clause only comp osite clauses with a material event comp onent are

covered in the current implementation The rst needed extension of the transitivity system is to cover

comp osite clauses with mental and verbal event comp onents as well

The hierarchy of general pro cess typ es dening the deep argument structure of a clause and the semantic

class of its main verb in the current implementation is a synthesis from Halliday Fawcett and

Lyons This hierarchy is compact and able to cover many clause structures Yet the argument structure

andor semantics of many English verbs do not t neatly in any element of this hierarchy This is why the

lexical pro cesses have b een added to surge as an alternative form of input sp ecication for clause structures

However lexical pro cesses are a shallower and less semantic form of input than the general pro cess typ es

For example the subcategorization constraints and the mapping from the thematic roles to the oblique roles

have to b e sp ecied in the input instead of b eing automatically computed by the grammar Minimizing the

use of this lexicalist approach to infrequent and authentically idiosyncratic clause structures would b e highly

desirable It would require b oth rening and extending the current hierarchy of general pro cess typ es An

excellent starting p oint for carrying out this task is the comprehensive taxonomy of English verbs prop osed

by Levin

B Adverbials

The coverage of the adverbial system in the current implementation is limited in two ways First among

the four typ es of syntactic functions that adverbial constituents can ll in a clause adjuncts disjuncts

conjuncts and sub juncts only the rst two are covered Second the only adverbial semantic roles covered

are those which can b e realized by either a clausal prep ositional or nominal form But for some adverbial

semantic roles the sole p ossible realization is an adverb Those semantic roles are not currently covered

These two limitations are correlated most conjuncts eg alternatively correspondingly or sub juncts eg

political ly cordial ly are adverbs and they convey semantics roles with no alternative syntactic realizations

The grammar provides a default ordering of clause constituents realizing adverbial semantic roles This

default sp ecies for a given set of input roles which will app ear in frontal p osition ie b efore the verb and

any of its arguments and which will app ear in trailing p ositions ie after the verb and all of its arguments

Another limitation of the current adverbial system is that though it allows input sp ecications overriding the

default choice of frontal vs trailing p osition for a given role it do es not allow input sp ecications overriding

the default relative ordering among several frontal or trailing adverbials Also the current grammar considers

only the frontal or trailing p osition and not the many medial p ositions available in clauses with multiple

argument verbs andor multiword verb groups This last ordering limitation is related to the semantic and

syntactic limitations mentioned ab ove medial p ositions are very rarely o ccupied by the clausal prep ositional

or nominal adjuncts and disjuncts They are in contrast the p osition of choice for adverbs functioning as

conjuncts and sub juncts

B Mo o d

Some forms of sub ordinate clauses are not yet covered by the current mo o d system This is the case for

example of the forms b elow where the sub ordinate clause is highlighted in b old

We must nd out how to handle mo o d in this framework

We must discover whether we can handle mo o d in this framework

We must know what they did

Currently the mo o d of a clause is enco ded as a single feature whose values form a complex hierar

chy The multidimensionality of the mo o d variations would b e b etter captured by switching to a multi

feature representation based on the set of mostly orthogonal and binary features finite imperative

question questiontype subordinate subordinatetype embedded subjunctive With such a

set the mo o d of a sub ordinate clause such as

The customer p ersuaded the programmer that there is a bug

which is currently enco ded as mood boundnominaldeclarative

would b ecome enco ded as

mood finite yes

subordinate yes

subordinatetype boundnominal

subjunctive no

Another necessary improvement to the mo o d system is to extend the currently quite limited set of factors

that inuence b oth the choices of binder and the restrictive status in relative clauses

B Voice

In terms of structural alternations aecting the fo cus and ordering of information in the clause the current

grammar covers only the passive and dative transformations The voice system thus needs to b e extended

to include the following alternations as well

Clefting eg Bo runs vs It is Bo who runs

25

Argument fronting eg Bo broke his p ersonal record yesterday vs His p ersonal record Bo broke

yesterday

Itextrap osition eg Whether Bo scores or not wont matter vs It wont matter whether Bo scores

or not

B Nominals

The most urgent improvement of the nominal system is to extend it to cover reexive pronouns Another

improvement would consist of rening the set of nominal functions currently available namely determiner

describer classifier head and qualifier In particular Fries prop oses a more complete set

of functions It also includes detailed constraints on coo ccurrence of several elements lling the same

function with dierent semantics in a given NP Finally the work of Vendler and Levi on

nominalizations and nonpredicative adjectives could serve as a go o d though quite limited in scop e

starting p oint towards the development of a set of thematic roles for a semantic input sp ecication of

nominals paralleling the set of thematic roles already used for the semantic input sp ecication of clauses

25

Distinct from adverbial fronting which the current grammar already handles

B Coverage of SURGE

surge comes with a large set of test inputs that to b e run after each change to the grammar This input

text set systematically prob es each branch of the grammar and many combinations of features from dierent

branches The initial input set was incrementally created over seven years during the development of the

generation systems comet McKeown et al McKeown et al cook Smadja and McKeown

and advisorI I Elhadad b The test inputs of surge are presented in Elhadad b In

this section I present the extension of the input set testing the extensions from from surge to surge

This additional test input set provides a detailed account of the extended coverage of surge

B Test inputs for the extensions at the nominal rank

Simple and complex nominals with quantities

deftest t

points

cat measure

quantity value

unit lex point

deftest t

Five rebounds

cat measure

quantity value

unit lex rebound

deftest t

blocked shots

cat measure

quantity value digit yes

unit lex blocked shot

deftest t

Six point shots

cat common

cardinal value

definite no

classifier cat measure

quantity value digit yes

unit lex point

head lex shot

deftest t

Season high

cat nouncompound

classifier lex season

head lex high

deftest t

The pleasant house property tax office furniture

cat common

describer lex pleasant

classifier cat nouncompound

classifier cat nouncompound

classifier cat nouncompound

classifier lex house

head lex property

head lex tax

head lex office

head lex furniture

deftest t

A season high points

cat common

definite no

classifier cat nouncompound

classifier lex season

head lex high

head cat measure

number plural

quantity value

unit lex point

deftest t

Stocktons season high points

cat common

possessor cat basicproper

lex Stockton

classifier cat nouncompound

classifier lex season

head lex high

head cat measure

quantity value

unit lex point

deftest t

Fourth quarter

cat measure

quantity cat ordinal

value

unit lex quarter

deftest t

fourth quarter points

cat measure

quantity value

unit cat nouncompound

classifier cat measure

quantity cat ordinal

value

unit lex quarter

head lex point

deftest t

A playoff record six fourth quarter three pointers

cat common

definite no

classifier cat nouncompound

classifier lex playoff

head lex record

head cat measure

quantity value

unit cat nouncompound

classifier cat measure

quantity cat ordinal value

unit lex quarter

head lex three pointer

deftest t

The Kingss NBA record th straight road loss

cat common

possessor cat basicproper

lex the Kings

classifier cat nouncompound

classifier lex NBA

head lex record

head cat measure

quantity cat ordinal value

unit cat nouncompound

classifier cat list

distinct cat adj

lex straight

cat noun

lex road

head lex loss

deftest t

The Hawks winners of six straight games and five in a row at the Omni

cat np

complex apposition

distinct

cat basicproper

number plural

lex Hawk

cat common

definite no

number plural

head lex winner

qualifier

cat pp

prep lex of

np cat np

complex conjunction

distinct

cat common

definite no

cardinal value

classifier cat adj lex straight

head lex game

cat common

definite no

cardinal value

head gap yes

qualifier cat list

distinct

cat pp

prep lex in

np cat common

definite no

head lex row

cat pp

prep lex at

np cat common

lex Omni

deftest t

of shots

cat partitive

part value

partof cat common

definite no

cardinal value

head lex shot

deftest t

of shots

cat partitive

part value

partof cat measure

quantity value

unit lex shot

storeplurals percent percent

deftest t

percent of their field goal attempts

cat partitive

part cat measure

quantity value

unit lex percent

partof cat common

possessor cat personalpronoun

person third

number plural

classifier lex field goal

number plural

head lex attempt

deftest t

percent of their field goal attempts

cat partitive

part cat common

definite no

cardinal value

head lex percent

partof cat common

possessor cat personalpronoun

person third

number plural

classifier lex field goal

number plural

head lex attempt

deftest t

His NBA season high

cat common

possessor cat personalpronoun

person third

gender masculine

number singular

classifier lex NBA

head cat nouncompound

classifier lex season

head lex high

deftest t

His NBA season high performance of points

cat common

possessor cat personalpronoun

person third

gender masculine

number singular

classifier cat nouncompound

classifier lex NBA

head cat nouncompound

classifier lex season

head lex high

head lex performance

qualifier cat pp

prep lex of

np cat common

definite no

cardinal value

head lex point

deftest t

His NBA season high performance of points

cat partitive

part cat common

possessor cat personalpronoun

person third

gender masculine

number singular

classifier cat nouncompound

classifier lex NBA

head cat nouncompound

classifier lex season

head lex high

head lex performance

partof cat measure

quantity value

unit lex point

deftest t

of an NBA team record blocked shots

cat partitive

part value digit yes

partof cat common

definite no

classifier cat nouncompound

classifier lex NBA aan an

head cat nouncompound

classifier lex team

head lex record

head cat measure

quantity value

unit lex blocked shot

deftest t

A team record of free throws

cat common

definite no

classifier cat nouncompound

classifier lex team

head lex record

head cat partitive

part value

partof cat measure

quantity value

unit lex free throw

deftest t

percent lowfat

cat ap

classifier cat measure

quantity value digit yes

unit lex percent

head lex lowfat

deftest t

Two heaping soup spoons of grade A percent lowfat pasteurized milk

cat partitive

part cat measure

quantity value

unit cat nouncompound

classifier cat verb

ending presentparticiple

lex heap

head cat nouncompound

classifier lex soup

head lex spoon

partof cat common

countable no

definite no

describer cat list

distinct cat basicproper

lex grade A

cat ap

classifier cat measure

quantity value

digit yes

unit lex percent

head lex lowfat

cat verb

ending pastparticiple

lex pasteurize

head lex milk

deftest t

All of its free throws

cat partitive

total

part value

partof cat common

possessor cat personalpronoun

person third

number singular

gender neuter

head lex free throw

deftest ta

point range

cat common

determiner none

classifier cat measure

quantity value

digit yes

unit lex point

head lex range

deftest t

Five of five from point range

cat partitive

part value

partof cat common

definite no

cardinal value

head gap yes

qualifier cat pp

prep lex from

np cat common

determiner none

classifier cat measure

quantity value

digit yes

unit lex point

head lex range

deftest t

A perfect for from the line

cat common

definite no

describer cat adj lex perfect

head cat partitive

prep lex for

part value

partof cat common

definite no

cardinal value

head gap yes

qualifier cat pp

prep lex from

np cat common

head lex line

deftest t

Stockton scored points

cat clause

tense past

process type material

effecttype creative

lex score

partic agent cat basicproper

lex Stockton

created cat measure

quantity value

unit lex point

deftest t

Stockton scored a season high points

cat clause

tense past

process type material

effecttype creative

lex score

partic agent cat basicproper

lex Stockton

created cat common

definite no

classifier cat nouncompound

classifier lex season

head lex high

head cat measure

quantity value

unit lex point

deftest t

Ewing scored of his points in the first half

cat clause

tense past

process type material

effecttype creative

lex score

partic agent cat basicproper

lex Ewing

created cat partitive

part value

partof cat common

possessor cat personalpronoun

person third

gender masculine

number singular

head cat measure

quantity value

unit lex point

circum inloc cat common

ordinal value

head lex half

deftest t

John Stockton scored points

cat clause

tense past

process type material

effecttype creative

lex score

partic agent cat compoundproper

head cat personname

firstname lex John

lastname lex Stockton

created cat measure

quantity value

unit lex point

deftest t

John Stockton scored a season high points

cat clause

tense past

process type material

effecttype creative

lex score

partic agent cat compoundproper

head cat personname

firstname lex John

lastname lex Stockton

created cat common

definite no

classifier cat nouncompound

classifier lex season

head lex high

head cat measure

quantity value

unit lex point

deftest t

John Stocktons points

cat common

possessor cat compoundproper

head cat personname

firstname lex John

lastname lex Stockton

head cat measure

quantity value

unit lex point

deftest t

John Stocktons season high points

cat common

possessor cat compoundproper

head cat personname

firstname lex John

lastname lex Stockton

classifier cat nouncompound

classifier lex season

head lex high

head cat measure

quantity value

unit lex point

deftest t

The Sacramento Kings NBA record th straight road loss

cat common

possessor cat compoundproper

number plural

head cat teamname

home lex Sacramento

franchise lex King

classifier cat nouncompound

classifier lex NBA

head lex record

head cat measure

quantity cat ordinal value

unit cat nouncompound

classifier cat list

distinct cat adj

lex straight

cat noun

lex road

head lex loss

Simple and complex nominals with prop er names

deftest t

Johnson

cat personname

lastname lex Johnson

deftest t

Earvin

cat personname

firstname lex Earvin

deftest t

Magic

cat personname

nickname lex Magic

deftest t

Earvin Johnson

cat personname

firstname lex Earvin

lastname lex Johnson

deftest t

Magic Johnson

cat personname

nickname lex Magic

lastname lex Johnson

deftest t

Earvin Magic Johnson

cat personname

firstname lex Earvin

nickname lex Magic

lastname lex Johnson

deftest t

Rufus T Firefly

cat personname

firstname lex Rufus

middlename lex T

lastname lex Firefly

deftest t

Ramses II

cat personname

firstname lex Ramses

dynasty cat ordinal value

deftest t

Earvin Johnson Sr

cat personname

dynasty father yes

firstname lex Earvin

lastname lex Johnson

deftest t

Earvin Johnson Jr

cat personname

dynasty father no

firstname lex Earvin

lastname lex Johnson

deftest t

James Baker III

cat personname

firstname lex James

lastname lex Baker

dynasty cat ordinal value

deftest t

Hugo Z Hackenbush XXIII

cat personname

firstname lex Hugo

middlename lex Z

dynasty cat ordinal value

lastname lex Hackenbush

deftest t

Dr Elhadad

cat personname

title lex Dr

lastname lex Elhadad

deftest t

Mr Bulent Fishkin

cat personname

title lex Mr

firstname lex Bulent

lastname lex Fishkin

deftest t

Dr Rufus T Firefly

cat personname

title lex Dr

firstname lex Rufus

middlename lex T

lastname lex Firefly

deftest t

Prof Hugo Z Hackenbush XXIII

cat personname

title lex Prof

firstname lex Hugo

middlename lex Z

dynasty cat ordinal value

lastname lex Hackenbush

deftest ta

Doctor Marshall President

cat list

distinct lex Doctor cat noun

lex Marshall cat noun

lex President cat noun

deftest t

Doctor Marshall President Idi Amin Dada

cat personname

title cat list

distinct lex Doctor cat noun

lex Marshall cat noun

lex President cat noun

firstname lex Idi

middlename lex Amin

lastname lex Dada

deftest t

Dr Rufus T Firefly and Dr Hugo Z Hackenbush

cat np

complex conjunction

distinct cat compoundproper

head cat personname

title lex Dr

firstname lex Rufus

middlename lex T

lastname lex Firefly

cat compoundproper

head cat personname

title lex Dr

firstname lex Hugo

middlename lex Z

lastname lex Hackenbush

deftest t

Mr and Ms Fishkin

cat np

complex conjunction

distinct cat compoundproper

head cat personname

title lex Mr

lastname gap yes

cat compoundproper

head cat personname

title lex Ms

lastname lex Fishkin

deftest t

The great Serigne Suleyman Abdul Aziz Seck

cat compoundproper

describer lex great

head cat personname

title lex Serigne

firstname lex Suleyman

middlename lex Abdul Aziz

lastname lex Seck

deftest t

Denver

cat teamname

home lex Denver

deftest tb

Denver

cat compoundproper

head cat teamname

home lex Denver

deftest tc

Denver

cat basicproper

lex Denver

deftest t

The Nuggets

cat compoundproper

number plural

head cat teamname

franchise lex Nugget

deftest tb

The Nuggets

cat basicproper

number plural

lex Nugget

deftest t

The Denver Nuggets

cat compoundproper

number plural

head cat teamname

home lex Denver

franchise lex Nugget

deftest t

The hapless Denver Nuggets

cat compoundproper

number plural

describer lex hapless

head cat teamname

home lex Denver

franchise lex Nugget

deftest t

The Denver Nuggets who extended their losing streak to five games

cat compoundproper

number plural

qualifier cat clause

process type composite

relationtype locative

lex extend

mood relative

tense past

scope partic agent

partic affected cat common

possessor cat personalpronoun

person third

number plural

classifier cat verb

ending presentparticiple

lex lose

head lex streak

located affected

location cat pp

prep lex to

np cat measure

quantity value

unit lex game

head cat teamname

home lex Denver

franchise lex Nugget

deftest t

The hapless Denver Nuggets who extended their losing streak to six games

cat compoundproper

number plural

describer lex hapless

qualifier cat clause

process type composite

relationtype locative

lex extend

mood relative

tense past

scope partic agent

partic affected cat common

possessor cat personalpronoun

person third

number plural

classifier cat verb

ending presentparticiple

lex lose

head lex streak

located affected

location cat pp

prep lex to

np cat measure

quantity value

unit lex game

head cat teamname

home lex Denver

franchise lex Nugget

deftest t

The Atlanta Hawks winners of six straight games and five in a row at the Omni

cat np

complex apposition

distinct

cat compoundproper

number plural

head cat teamname

home lex Atlanta

franchise lex Hawk

cat common

definite no

number plural

head lex winner

qualifier

cat pp

prep lex of

np cat np

complex conjunction

distinct

cat common

definite no

cardinal value

classifier cat adj lex straight

head lex game

cat common

definite no

cardinal value

head gap yes

qualifier cat list

distinct

cat pp

prep lex in

np cat common

definite no

head lex row

cat pp

prep lex at

np cat common

lex Omni

deftest t

Dr Rufus T Firefly gave Prof Hugo Z Hackenbush his best regards

cat clause

tense past

process type composite

relationtype possessive

lex give

partic agent cat compoundproper

head cat personname

title lex Dr

firstname lex Rufus

middlename lex T

lastname lex Firefly

affected cat compoundproper

head cat personname

title lex Prof

firstname lex Hugo

middlename lex Z

lastname lex Hackenbush

possessor affected

possessed cat common

number plural

possessor cat personalpronoun

person third

number singular

gender masculine

describer lex best

head lex regard

deftest t

The highly respected Dr Rufus T Firefly gave Prof Hugo Z Hackenbush who

received the Turing award yesterday his best regards

cat clause

tense past

process type composite

relationtype possessive

lex give

partic

agent cat compoundproper

describer cat ap

modifier lex highly

head lex respected

head cat personname

title lex Dr

firstname lex Rufus

middlename lex T

lastname lex Firefly

affected cat compoundproper

head cat personname

title lex Prof

firstname lex Hugo

middlename lex Z

lastname lex Hackenbush

qualifier cat clause

tense past

mood relative

restrictive no

voice passive

process type composite

agentive no

relationtype possessive

lex receive

scope partic affected

partic

possessor affected

possessed cat common

classifier cat basicproper

lex Turing

head lex award

circum time cat adv

lex yesterday

possessor affected

possessed cat common

number plural

possessor

cat personalpronoun

semantics index agent semantics index

describer lex best

head lex regard

deftest tbis

The seemingly unstoppable San Antonio Spurs extended their winning streak to

games with a victory over the hapless Denver Nuggets losers of in a row

cat clause

tense past

process type composite

relationtype locative

lex extend

partic

agent cat compoundproper

number plural

describer cat ap

modifier lex seemingly

head lex unstoppable

head cat teamname

home lex San Antonio

franchise lex Spur

affected cat common

possessor cat personalpronoun

number plural

person third

classifier cat verb

ending presentparticiple

lex win

head lex streak

located affected

location cat pp

prep lex to

np cat measure

quantity value

unit lex game

predmodif

instrument

cat pp

np cat common

definite no

classifier cat list

distinct cat measure

quantity value

unit gap yes

cat measure

quantity value

unit gap yes

head lex victory

qualifier

cat pp

prep lex over

np cat np

complex apposition

distinct

cat compoundproper

number plural

describer lex hapless

head cat teamname

home lex Denver

franchise lex Nugget

cat common

definite no

number plural

head lex loser

qualifier

cat pp

prep lex of

np cat common

definite no

cardinal value

head gap yes

qualifier

cat pp

prep lex in

np cat common

definite no

head lex row

deftest tter

The hapless Denver Nuggets losers of in a row

cat np

complex apposition

distinct

cat compoundproper

number plural

describer lex hapless

head cat teamname

home lex Denver

franchise lex Nugget

cat common

definite no

number plural

head lex loser

qualifier

cat pp

prep lex of

np cat common

definite no

cardinal value

head gap yes

qualifier

cat pp

prep lex in

np cat common

definite no

head lex row

deftest tquad

Losers of in a row

cat common

definite no

number plural

head lex loser

qualifier

cat pp

prep lex of

np cat common

definite no

cardinal value

head gap yes

qualifier

cat pp

prep lex in

np cat common

definite no

head lex row

deftest t

San Antonio defeated Denver

cat clause

tense past

process lex defeat

partic agent cat basicproper

lex San Antonio

affected cat basicproper

lex Denver

deftest tb

San Antonio defeated Denver

cat clause

tense past

process lex defeat

partic agent cat compoundproper

head cat teamname

home lex San Antonio

affected cat compoundproper

head cat teamname

home lex Denver

deftest t

The Spurs defeated the Nuggets

cat clause

tense past

process lex defeat

partic agent cat compoundproper

number plural

head cat teamname

franchise lex Spur

affected cat compoundproper

number plural

head cat teamname

franchise lex Nugget

deftest t

My advisor and friend Steve

cat np

complex apposition

restrictive no

distinct

cat common

head cat noun

complex conjunction

distinct lex advisor

lex friend

possessor cat personalpronoun

person first

cat basicproper

lex Steve

deftest t

Three times the butter

cat common

lex butter

countable no

multiplier value

definite yes

deftest t

The man whose hat is on the ground

cat common

lex man

animate yes

qualifier cat clause

mood relative

process type locative

scope partic located possessor

partic located cat common lex hat

location cat pp

prep lex on

np cat common lex ground

deftest t

Bowman whose passing game is exceptional

cat compoundproper

head cat personname

lastname lex Bowman

animate yes

qualifier cat clause

mood relative

restrictive no

process type ascriptive

scope partic carrier possessor

partic carrier cat common

classifier cat verb

ending presentparticiple

lex pass

head lex game

attribute cat ap lex exceptional

deftest t

A win over Denver sending the Nuggets to their seventh straight loss

cat common

definite no

classifier cat score

win value

lose value

head lex win

qualifier

cat list

distinct cat pp

prep lex over

np cat compoundproper

head cat teamname

home lex Denver

cat clause

mood presentparticiple

process type composite

relationtype locative

lex send

controlled partic agent

partic agent index index

located cat compoundproper

number plural

head cat teamname

franchise lex Nugget

affected located

location cat pp

prep lex to

np cat common

possessor cat personalpronoun

index located index

ordinal value

describer lex straight

head lex loss

deftest t

John gives Mary his book

cat clause

proc type composite relationtype possessive

partic agent cat basicproper lex John gender masculine

affected cat basicproper lex Mary gender feminine

possessor affected

possessed cat common

possessor cat personalpronoun

index agent index

head lex book

Dates

deftest t

Friday

cat date dayname lex Friday

deftest t

Friday night

cat date dayname lex Friday daypart lex night

deftest t

June

cat date month lex June year value

deftest t

Friday the th at night

cat date

dayname lex Friday

daynum value

daypart lex night

deftest t

Friday in the evening

cat date

dayname lex Friday

daynum value

month value

year value

daypart lex evening

deftest t

Friday the th of June in the morning

The morning of Friday June th

cat date

dayname lex Friday

daynum value

month lex June

year value

daypart lex morning

deftest t

It happened the morning of Friday June th

It happened Friday the th of June in the morning

cat clause

tense past

process type temporal lex happen

partic located cat personalpronoun

time cat date

dayname lex Friday

daynum value

month lex June

year value

daypart lex morning

deftest ta

Saturday night Karl Malone scored points with his hands

cat clause

tense past

process type material effecttype creative lex score

partic agent cat compoundproper

gender masculine

head cat personname

firstname lex Karl

lastname lex Malone

created cat measure

quantity value

unit lex point

predmodif instrument cat pp

np cat common

number plural

possessor cat personalpronoun

index partic agent index

head lex hand

circum time cat date

dayname lex Saturday

daypart lex night

position header

deftest tb

Karl Malone scored points with his hands Saturday night

cat clause

tense past

process type material effecttype creative lex score

partic agent cat compoundproper

gender masculine

head cat personname

firstname lex Karl

lastname lex Malone

created cat measure

quantity value

unit lex point

predmodif instrument cat pp

np cat common

number plural

possessor cat personalpronoun

index partic agent index

head lex hand

circum time cat date

position end

dayname lex Saturday

daypart lex night

Addresses

deftest t

Visconde de Piraja Street Ipanema

cat address

num value

stname lex Visconde de Piraja

sttype lex Street

aptnum lex

hood lex Ipanema

deftest t

West th Street D Manhattan

cat address

side lex West

num value

stname value

hood lex Manhattan

aptnum lex D

sttype lex Street

deftest t

Michael Elhadad Boulevard NW Bersheva

cat address

num value

stname cat personname

firstname lex Michael

lastname lex Elhadad

sttype lex Boulevard

aptnum lex

quadrant lex NW

city lex Bersheva

deftest t

Pelourinho Bahia Brazil

cat address

hood lex Pelourinho

state lex Bahia

country lex Brazil

deftest t

Rio de Janeiro RJ Brazil

cat address

city lex Rio de Janeiro

state lex RJ

country lex Brazil

deftest t

Visconde de Piraja Street Ipanema Rio de Janeiro RJ Brazil

cat address

num value

stname lex Visconde de Piraja

sttype lex Street

aptnum lex

hood lex Ipanema

city lex Rio de Janeiro

state lex RJ

country lex Brazil

deftest t

Michael Elhadad Boulevard NW Bersheva The Negev Israel

cat address

num value

stname cat personname

firstname lex Michael

lastname lex Elhadad

sttype lex Boulevard

aptnum lex

quadrant lex NW

city lex Bersheva

state lex The Negev

zip value

country lex Israel

deftest t

Faidherbe Prolongee Avenue Medina PO Box Dakar Senegal

cat address

num value

stname lex Faidherbe Prolongee

sttype lex Avenue

hood lex Medina

poboxnum value

city lex Dakar

country lex Senegal

deftest t

San Antonio Texas

cat address city lex San Antonio state lex Texas

deftest t

I live Visconde de Piraja Street Ipanema Rio de Janeiro RJ Brazil

cat clause

process type locative lex live

partic located cat personalpronoun person first

location cat address

num value

stname lex Visconde de Piraja

sttype lex Street

aptnum lex

hood lex Ipanema

city lex Rio de Janeiro

state lex RJ

country lex Brazil

deftest t

In Brazil near Bahia

cat pp

complex apposition

restrictive no

distinct prep lex in

np cat basicproper lex Brazil

prep lex near

np cat basicproper lex Bahia

deftest t

In Bahia and in Rio

cat pp

complex conjunction

distinct prep lex in

np cat basicproper lex Bahia

prep lex in

np cat basicproper lex Rio

B Test inputs for the extensions at the complex sentence rank

Tests location as predmodif PP

direction as adverb

origin destination and distance as PP

path

cooccurrence of all spatial roles

deftest c

In France Bo biked south miles along the Rhone from Lyon to the Camargue

cat clause

tense past

process type material effective no lex bike

partic agent cat basicproper lex Bo

predmodif direction cat adv lex south

distance cat measure

quantity value

unit lex mile

path cat pp

prep lex along

np cat basicproper lex the Rhone

origin cat pp

np cat basicproper lex Lyon

destination cat pp

np cat basicproper lex the Camargue

circum location cat pp

prep lex in

np cat basicproper

lex France

Tests cooccurrence of locations predmodif circum

deftest c

On the platform Bo kissed her on the cheek

cat clause

tense past

process type material lex kiss

partic agent cat basicproper lex Bo

affected cat personalpronoun gender feminine

predmodif location cat pp

prep lex on

np cat common lex cheek

circum location cat pp

prep lex on

np cat common lex platform

Tests location as adverb and finite clause

deftest c

There Bo kissed her where she wanted

cat clause

tense past

process type material lex kiss

partic agent cat basicproper lex Bo

affected cat personalpronoun gender feminine

predmodif

location cat clause

mood boundadverbial

tense past

process type mental lex want transitive no

partic processor cat personalpronoun

index partic affected index

circum location cat adv lex there

Tests duration as measure

frequency as common

time as common

cooccurrence of temporal roles

inclusion as PP

deftest c

This month Bo works out four hours each day including Sundays

cat clause

process type material effective no lex work out

partic agent cat basicproper lex Bo

predmodif duration cat measure

quantity value

unit lex hour

circum frequency cat common

total

number singular

lex day

position end

time cat common distance near lex month position front

inclusion cat pp

np cat common

number plural

definite no

lex Sunday

storeverbs work out works out worked out working out worked out

Tests direction as common

destination as adverb

time as adverb

frequency as adverb

deftest c

Often Bo runs this way home now

cat clause

process type material effective no lex run

partic agent cat basicproper lex Bo

predmodif destination cat adv lex home

direction cat common distance near lex way

circum frequency cat adv lex often

time cat adv lex now

storeverbs run runs ran running ran

deftest cbis

Bo runs home this way often now

cat clause

process type material effective no lex run

partic agent cat basicproper lex Bo

endadverbial cat adv lex home

endadverbial cat common distance near lex way

endadverbial cat adv lex often

endadverbial cat adv lex now

Tests location as verbless clause

time as finite clause

deftest c

As soon as you find it keep it where accessible to all authorized users

cat clause

mood imperative

process type material lex keep

partic agent index circum time partic processor index

affected cat personalpronoun

predmodif

location

cat clause

mood verbless

controlled partic carrier

process type ascriptive

partic carrier index partic affected index

attribute cat ap

head lex accessible

qualifier cat pp

prep lex to

np cat common

number plural

definite no

total

describer lex authorized

head lex user

circum time cat clause

mood boundadverbial

binder lex as soon as

process type mental lex find

partic processor cat personalpronoun person second

phenomenon cat personalpronoun

index partic affected index

Tests direction as PP

distance as common

time as presentparticiple clause

deftest c

After lifting weights Tony biked up the mountain a few miles

cat clause

tense past

process type material effective no lex bike

partic agent cat basicproper lex Tony

predmodif direction cat pp

prep lex up

np cat common lex mountain

distance cat common

lex mile

definite no

orientation

exact no

number plural

degree

circum time cat clause

mood presentparticiple

controlled partic agent

binder lex after

process type material effective no lex lift

partic agent index partic agent index

range cat common

definite no

number plural

lex weight

Tests time as pastparticiple clause

duration as adverb

deftest c

Once refused a new contract Bo held out indefinitely

cat clause

tense past

process type composite relationtype locative lex hold

partic agent cat basicproper lex Bo

located agent

location cat adv lex out

predmodif duration cat adv lex indefinitely

circum time cat clause

mood pastparticiple

controlled partic possessor

binder lex once

process type composite

relationtype possessive

lex refuse

partic possessor index partic processor index

possessed cat common

definite no

describer lex new

head lex contract

affected possessor

storeverbs hold holds held holding held

Tests duration as PP

frequency as PP

deftest c

For two years Bo worked out on Sundays

cat clause

tense past

process type material effective no lex work out

partic agent cat basicproper lex Bo

circum frequency cat pp

position end

prep lex on

np cat common

definite no

number plural

lex Sunday

duration cat pp

position front

np cat measure

quantity value

unit lex year

Tests duration as finite clause

reason as PP

deftest c

Because of his injury Bo did not play until the playoffs started

cat clause

tense past

polarity negative

process type material effective no lex play

partic agent cat basicproper lex Bo gender masculine

circum reason cat pp

np cat common

possessor cat personalpronoun

index partic agent index

lex injury

duration cat clause

tense past

mood boundadverbial

binder lex until

process type material

agentive no

effecttype creative

lex start

partic created cat common

number plural

lex playoff

Tests duration as presentparticiple clause

coevent as presentparticiple clause

presentparticiple clause with subject

deftest c

With his knees hampering him Blackman has not played since becoming a Knick

cat clause

tense presentperfect

polarity negative

process type material effective no lex play

partic agent cat basicproper lex Blackman gender masculine

circum

duration cat clause

mood presentparticiple

controlled partic agent

binder lex since

process type composite relationtype ascriptive lex become

partic agent index partic agent index

carrier agent

attribute cat common

definite no

head lex Knick

coevent cat clause

mood presentparticiple

binder lex with

process type material lex hamper

partic agent cat common

possessor cat personalpronoun

index partic agent index

number plural

lex knee

affected cat personalpronoun

index partic agent index

Tests duration as verbless clause

habitual coevent as finite clause with subject

deftest c

Whenever pain resurfaces rest as long as necessary

cat clause

mood imperative

process type material effective no lex rest

circum duration cat clause

mood verbless

binder lex as long as

process type ascriptive

controlled partic carrier

partic attribute cat ap lex necessary

coevent cat clause

mood boundadverbial

habitual yes

process type material effective no lex resurface

partic agent cat common

countable no

lex pain

Tests coevent as subjectless presentparticiple clause

cat date

cat address

header adverbial

inclusion as presentparticiple clause

deftest c

Utah Karl Malone scored points Saturday

including making of field goals leading the Utah Jazz to a win

over the Los Angeles Clippers

cat clause

tense past

process type material effecttype creative lex score

partic agent cat compoundproper

head cat personname

firstname lex Karl

lastname lex Malone

created cat measure

quantity value

unit lex point

circum

time cat date dayname lex Saturday

location cat address

position header

punctuation after

city lex Salt Lake City

state lex Utah

inclusion cat clause

mood presentparticiple

position end

controlled partic agent

process type material effecttype creative lex make

partic

agent index partic agent index

created cat partitive

part value digit yes

partof cat measure

quantity value

unit lex field goal

coevent

cat clause

mood presentparticiple

position end

controlled partic agent

process type composite

relationtype locative

lex lead

partic

agent index partic agent index

located cat compoundproper

number plural

head cat teamname

home lex Utah

franchise lex Jazz

affected located

location

cat pp

prep lex to

np cat common

definite no

classifier cat score

win value

lose value

head lex win

qualifier

cat pp

prep lex over

np cat compoundproper

number plural

head cat teamname

home lex Los Angeles

franchise lex Clipper

storeplurals Jazz Jazz

Tests coevent as subjectless pastparticiple clause

opposition role

deftest c

Injured against the Giants Bo missed games

cat clause

tense past

process type material effective no lex miss

partic agent cat basicproper lex Bo

range cat measure quantity value unit lex game

circum

coevent

cat clause

mood pastparticiple

controlled partic affected

binder none

process type material lex injure agentive no

partic affected index partic agent index

circum opposition cat pp

np cat compoundproper

number plural

head cat teamname

franchise lex Giant

Tests coevent as verbless clause

toinfinitive clause as adjective qualifier

reason as adjunct finite clause

deftest c

Unable to play because he was injured Bo stayed home

cat clause

tense past

process type composite relationtype locative lex stay

partic agent cat basicproper lex Bo gender masculine

located agent

location cat adv lex home

circum

coevent

cat clause

position front

mood verbless

binder none

controlled partic carrier

process type ascriptive

partic

carrier index partic agent index

attribute

cat ap

head lex unable

qualifier cat clause

mood toinfinitive

controlled partic agent

process type material effective no lex play

partic agent index carrier index

circum reason cat clause

mood boundadverbial

tense past

process type ascriptive

partic carrier cat personalpronoun

index partic located index

attribute cat verb

ending pastparticiple

lex injure

Tests reason as disjunct finite clause

embedded reason

deftest c

Since he was unable to play because of injury Bo stayed home

cat clause

tense past

process type composite relationtype locative lex stay

partic agent cat basicproper lex Bo gender masculine

located agent

location cat adv lex home

circum

reason

cat clause

position front

mood boundadverbial

tense past

binder lex since

process type ascriptive

partic

carrier cat personalpronoun

index partic agent index

attribute

cat ap

head lex unable

qualifier cat clause

mood toinfinitive

controlled partic agent

process type material effective no lex play

partic agent index carrier index

circum reason cat pp

np cat common

denotation illness

lex injury

Tests result as finite clause

deftest c

Bo was unable to play because of injury so he stayed home

cat clause

tense past

process type ascriptive

partic

carrier cat basicproper lex Bo gender masculine

attribute cat ap

head lex unable

qualifier cat clause

mood toinfinitive

process type material effective no lex play

partic agent index carrier index

circum reason cat pp

np cat common

denotation illness

lex injury

circum

result cat clause

mood boundadverbial

binder lex so

tense past

process type composite relationtype locative lex stay

partic agent cat personalpronoun

index partic carrier index

located agent

location cat adv lex home

Tests result as toinfinitive clause

score as circumstance

deftest c

The Utah Jazz defeated the Los Angeles Clippers to extend their winning

streak to games

cat clause

tense past

process type material lex defeat

partic agent cat compoundproper

number plural

head cat teamname

home lex Utah

franchise lex Jazz

affected cat compoundproper

number plural

head cat teamname

home lex Los Angeles

franchise lex Clipper

predmodif score cat score win value lose value

circum result cat clause

mood toinfinitive

controlled partic agent

process type composite

relationtype locative

lex extend

partic agent index partic agent index

affected located

located cat common

possessor cat personalpronoun

index agent index

describer cat verb

ending presentparticiple

lex win

head lex streak

location cat pp

prep lex to

np cat measure

quantity value

unit lex game

Tests purpose as PP

deftest c

The Utah Jazz defeated the Los Angeles Clippers for its th straight

victory

cat clause

tense past

process type material lex defeat

partic agent cat compoundproper

number plural

head cat teamname

home lex Utah

franchise lex Jazz

affected cat compoundproper

number plural

head cat teamname

home lex Los Angeles

franchise lex Clipper

predmodif score cat score win value lose value

circum purpose cat pp

position end

np cat common

possessor cat personalpronoun

index partic agent index

ordinal value

classifier cat adj lex straight

head lex victory

Tests purpose as finite clause

exception role

deftest c

Except Bo all players do commercials so they can make extra money

cat clause

process type material effective no lex do

partic agent cat common

number plural

definite no

total

lex player

range cat common

number plural

definite no

lex commercial

circum exception cat pp np cat basicproper lex Bo

purpose cat clause

mood boundadverbial

binder lex so

epistemicmodality possible

process type material effecttype creative lex make

partic agent cat personalpronoun

index partic agent index

created cat common

countable no

definite no

classifier cat adj lex extra

head lex money

Tests behalf

condition as finite clause

deftest c

If he is percent fit Bo will play for the Raiders on Sunday

cat clause

tense future

process type material effective no lex play

partic agent cat basicproper lex Bo gender masculine

circum behalf cat pp

np cat compoundproper

number plural

head cat teamname

franchise lex Raider

time cat pp

position end

prep lex on

np cat basicproper lex Sunday

condition cat clause

position front

mood boundadverbial

process type ascriptive

partic carrier cat personalpronoun

index partic agent index

attribute cat ap

classifier cat measure

quantity value

unit lex percent

head lex fit

Tests condition as verbless clause

substitution as PP

concession as finite clause

deftest c

Even though he was not a starter at the beginning of the season if fully fit

Bo will start tomorrow instead of Smith

cat clause

tense future

process type material agentive no lex start

partic affected cat basicproper lex Bo gender masculine

circum

concession cat clause

tense past

binder lex even though

polarity negative

process type ascriptive

partic carrier cat personalpronoun

index partic affected index

attribute cat common

definite no

lex starter

circum time cat pp

prep lex at

np cat common

head lex beginning

qualifier cat pp

prep lex of

np cat common

lex season

condition cat clause

position front

mood verbless

process type ascriptive

controlled partic carrier

partic carrier index partic affected index

attribute cat ap

modifier cat adv lex fully

head lex fit

substitution cat pp np cat basicproper lex Smith

time cat adv lex tomorrow

Tests condition as pastparticiple clause

concession as PP

deftest c

If defeated tonight the Raiders will not make the playoffs despite their better

division record

cat clause

tense future

polarity negative

process type material effective no lex make

partic agent cat compoundproper

number plural

head cat teamname

franchise lex Raider

range cat common number plural lex playoff

circum condition cat clause

mood pastparticiple

controlled partic affected

process type material lex defeat

partic affected index partic agent index

circum time cat adv lex tonight

concession cat pp

prep lex despite

np cat common

possessor cat personalpronoun

index partic agent index

describer lex better

classifier lex division

head lex record

Tests time as verbless clause

matter role

manner as adverb

distance as measure

destination as finite clause

substitution as presentparticiple clause

deftest c

Instead of panicking when in doubt concerning his position Bo cautiously

retreated five miles towards where he came from

cat clause

tense past

process type material effective no lex retreat

partic agent cat basicproper lex Bo gender masculine

predmodif distance cat measure

quantity value

unit lex mile

destination cat clause

mood boundadverbial

tense past

binder lex towards where

process type material

effective no

lex come from

partic agent cat personalpronoun

index partic agent index

circum

time cat clause

mood verbless

controlled partic carrier

process type ascriptive

partic carrier index partic agent index

attribute cat pp

prep lex in

np cat common

denotation zeroarticlething

lex doubt

circum matter cat pp

prep lex concerning

np cat common

possessor cat personalpronoun

index partic carrier index

head lex position

manner cat adv lex carefully

substitution cat clause

mood presentparticiple

controlled partic processor

process type mental transitive no lex panic

partic processor index partic agent index

storeverbs come from comes from came from coming from came from

storeverbs panic panics panicked panicking panicked

Tests duration as pastparticiple clause

habitual coevent as verbless clause

manner as PP

means as PP

deftest c

Whenever in doubt resist with obstination until coerced by force

cat clause

mood imperative

process type material effective no lex resist

partic agent cat personalpronoun person second

predmodif manner cat pp

np cat common lex obstination countable no

circum

duration cat clause

position end

mood pastparticiple

binder lex until

process type material lex coerce

controlled partic affected

partic affected index partic agent index

predmodif means cat pp

np cat common lex force

coevent cat clause

position front

mood verbless

habitual yes

process type ascriptive

controlled partic carrier

partic carrier index partic agent index

attribute cat pp

prep lex in

np cat common

denotation zeroarticlething

lex doubt

Tests negative comparison as PP

concessivecondition as finite clause

deftest c

Unlike Barkley Jordan can dominate even if he is not percent fit

cat clause

epistemicmodality possible

process type material effective no lex dominate

partic agent cat basicproper lex Jordan gender masculine

circum comparison cat pp

position front

comppolarity

np cat basicproper

lex Barkley

concessivecondition

cat clause

mood boundadverbial

polarity negative

process type ascriptive

partic carrier cat personalpronoun

index partic agent index

attribute cat ap

classifier cat measure

quantity value

unit lex percent

head lex fit

Tests concessivecondition as presentparticiple clause

positive instrument

contrast as PP

deftest c

As opposed to pure penetrators Jordan can kill you with his three point shot

even if suffering from tendinitis

cat clause

epistemicmodality possible

process type material lex kill

partic agent cat basicproper lex Jordan gender masculine

affected cat personalpronoun person second

predmodif instrument cat pp

np cat common

possessor cat personalpronoun

index partic agent index

classifier cat measure

quantity value

unit lex point

head lex shot

circum contrast cat pp

prep lex as opposed to

np cat common

number plural

definite no

describer lex pure

head lex penetrator

concessivecondition

cat clause

mood presentparticiple

controlled partic processor

process type mental lex suffer from

partic processor index partic agent index

phenomenon cat common

denotation illness

lex tendinitis

storeverbs suffer from suffers from suffered from

suffering from suffered from

Tests concessivecondition as verbless clause

contrast as finite clause

presentparticiple clause as adjective qualifier

deftest c

Even if capable of playing forward Magic was a guard whereas Bird was a true

forward

cat clause

tense past

process type ascriptive

partic carrier cat basicproper lex Magic

attribute cat common definite no lex guard

circum contrast cat clause

tense past

mood boundadverbial

process type ascriptive

partic carrier cat basicproper lex Bird

attribute cat common

definite no

classifier lex true

head lex forward

concessivecondition

cat clause

mood verbless

process type ascriptive

controlled partic carrier

partic

carrier index partic carrier index

attribute

cat ap

head lex capable

qualifier

cat clause

mood presentparticiple

binder lex of

controlled partic agent

process type material effective no lex play

partic agent index partic carrier index

range cat common

denotation zeroarticlething

lex forward

Tests concessivecondition as pastparticiple clause

positive comparison as PP

pastparticiple clause with agent

deftest c

Like Jordan Drexler can kill you from outside even if slowed down by injury

cat clause

epistemicmodality possible

process type material lex kill

partic agent cat basicproper lex Drexler

affected cat personalpronoun person second

circum comparison cat pp np cat basicproper lex Jordan

concessivecondition

cat clause

mood pastparticiple

controlled partic affected

process type material

lex slow down

agentless no

partic agent cat common denotation illness lex injury

affected index partic agent index

location cat pp

prep lex from

np cat common

denotation zeroarticlething

lex outside

storeverbs slow down slows down slowed down slowing down slowed down

Tests concession as pastparticiple clause

positive accompaniment

deftest c

Although not wanted by the Knicks Kimble was traded with Smith for Jackson

cat clause

tense past

process type material lex trade voice passive

partic agent cat compoundproper

number plural

head cat teamname franchise lex Knick

affected cat basicproper lex Kimble

circum accompaniment cat pp

np cat basicproper lex Smith

concession

cat clause

position front

mood pastparticiple

polarity negative

process type mental lex want agentless no

controlled partic phenomenon

partic processor partic agent

phenomenon index partic affected index

purpose cat pp

position end

np cat basicproper lex Jackson

Tests concession as presentparticiple clause

negative accompaniment

comparison as finite clause

deftest c

Although playing without Bo the Raiders won as they did with him

cat clause

tense past

process type material effective no lex win

partic agent cat compoundproper

number plural

head cat teamname

franchise lex Raider

predmodif

comparison

cat clause

tense past

mood boundadverbial

binder lex as

process type material effective no lex do

partic agent cat personalpronoun

index partic agent index

circum

accompaniment

cat pp

np cat personalpronoun

index circum concession circum accompaniment np index

circum concession cat clause

mood presentparticiple

controlled partic agent

process type material effective no lex play

partic agent index partic agent index

circum accompaniment cat pp

accomppolarity

np cat basicproper

gender masculine

lex Bo

storeverbs win wins won winning won

Tests concession as verbless clause

addition as PP

deftest c

Although in need of a center they drafted a guard in addition to Flint

cat clause

tense past

process type composite relationtype possessive lex draft

partic agent cat personalpronoun number plural

possessor agent

possessed cat common definite no lex guard

circum addition cat pp

prep lex in addition to

np cat basicproper lex Flint

concession cat clause

mood verbless

process type ascriptive

controlled partic carrier

partic carrier index partic agent index

attribute cat pp

prep lex in need of

np cat common

definite no

lex center

Tests negative instrument

means as presentparticiple clause and PP

standard and perspective roles

addition as presentparticiple clause

deftest c

They filled all their holes without a draft pick by acquiring Bowman whose passing

game is exceptional for a seven footer through freeagency in addition to trading

for Ashbliss who remains incomparable as a defensive stopper

cat clause

tense past

process type material lex fill

partic agent cat personalpronoun number plural

affected cat common

number plural

possessor cat personalpronoun

index agent index

head lex hole

predmodif instrument cat pp

instrpolarity

np cat common

definite no

classifier lex draft

head lex pick

circum

means

cat clause

position end

mood presentparticiple

process type composite relationtype possessive lex acquire

controlled partic agent

partic

agent index partic agent index

possessor agent

possessed cat compoundproper

head cat personname

lastname lex Bowman

qualifier

cat clause

mood relative

restrictive no

process type ascriptive

scope partic carrier possessor

partic carrier cat common

classifier cat verb

ending presentparticiple

lex pass

head lex game

attribute cat ap lex exceptional

circum standard cat pp

np cat common

definite no

lex seven footer

predmodif means cat pp

prep lex through

np cat common lex free agency

circum

addition

cat clause

position end

mood presentparticiple

controlled partic agent

process type composite relationtype possessive lex trade for

partic

agent index partic agent index

possessor agent

possessed

cat compoundproper

head cat personname

lastname lex Ashbliss

qualifier

cat clause

mood relative

process type ascriptive lex remain

scope partic carrier

partic attribute cat ap lex incomparable

circum

perspective

cat pp

np cat common

definite no

classifier lex defensive

head lex stopper

storeverbs trade for trades for traded for trading for traded for

Tests comparison as verbless clause

means as adverb

deftest c

They treated him homeopathically as if afraid of conventional treatments

cat clause

tense past

process type material lex treat

partic agent cat personalpronoun number plural

affected cat personalpronoun gender masculine

predmodif

means cat adv lex homeopathically

comparison

cat clause

mood verbless

process type ascriptive

controlled partic carrier

partic carrier index partic agent index

attribute cat ap

head lex afraid

qualifier cat pp

prep lex of

np cat common

number plural

definite no

describer lex conventional

head lex treatment

Tests comparison as pastparticiple clause

deftest c

They treated him homeopathically as if frightened by conventional treatments

cat clause

tense past

process type material lex treat

partic agent cat personalpronoun number plural

affected cat personalpronoun gender masculine

predmodif

means cat adv lex homeopathically

comparison

cat clause

mood pastparticiple

process type mental lex frighten agentless no

controlled partic phenomenon

partic phenomenon index partic agent index

processor cat common

number plural

definite no

describer lex conventional

head lex treatment

storeverbs frighten frightens frightened frightening frightened

Tests comparison as presentparticiple clause

deftest c

They treated him homeopatically as if distrusting conventional treatments

cat clause

tense past

process type material lex treat

partic agent cat personalpronoun number plural

affected cat personalpronoun gender masculine

predmodif

means cat adv lex homeopatically

comparison

cat clause

mood presentparticiple

process type mental lex distrust

controlled partic processor

partic processor index partic agent index

phenomenon cat common

number plural

definite no

describer lex conventional

head lex treatment

Tests comparison as toinfinitive clause

pastparticiple clause with subject

deftest c

The analysis completed they treated him homeopatically as if to avoid conventional

treatments

cat clause

tense past

process type material lex treat

partic agent cat personalpronoun number plural

affected cat personalpronoun gender masculine

predmodif

means cat adv lex homeopatically

comparison

cat clause

mood toinfinitive

process type material effective no lex avoid

controlled partic agent

partic agent index partic agent index

range cat common

number plural

definite no

describer lex conventional

head lex treatment

circum coevent cat clause

mood pastparticiple

process type material lex complete

partic affected cat common lex analysis

Alternative analysis tests verbless clause with subject

deftest cbis

The analysis completed they treated him homeopatically as if to avoid conventional

treatments

cat clause

tense past

process type material lex treat

partic agent cat personalpronoun number plural

affected cat personalpronoun gender masculine

predmodif

means cat adv lex homeopatically

comparison

cat clause

mood toinfinitive

process type material effective no lex avoid

partic agent index partic agent index

range cat common

number plural

definite no

describer lex conventional

head lex treatment

circum coevent cat clause

mood verbless

process type ascriptive

partic carrier cat common lex analysis

attribute cat verb

ending pastparticiple

lex complete

Tests presentparticiple clause as noun qualifier

pastparticiple clause as noun qualifier

toinfinitive clause as noun qualifier

deftest c

winning two games will be the team to take the title recognized by

the WSL

cat clause

tense future

process type ascriptive relationmode equative

partic

identified cat common

ordinal value

head lex team

qualifier cat clause

mood presentparticiple

process type material effective no lex win

controlled partic agent

partic agent index identified index

range cat measure

quantity value

unit lex game

identifier

cat common

head lex team

qualifier

cat clause

mood toinfinitive

process type material effective no lex take

controlled partic agent

partic

agent index identifier index

range cat common

head lex title

qualifier

cat clause

mood pastparticiple

process type mental lex recognize agentless no

controlled partic phenomenon

partic processor cat basicproper lex the WSL

phenomenon index range index

Tests fortoinfinitive clause with subject as adverbial

deftest c

I did it for you to graduate

cat clause

tense past

process type material effecttype creative lex do

partic agent cat personalpronoun person first

created cat personalpronoun

circum

purpose cat clause

mood fortoinfinitive

binder none

process type mental transitive no lex graduate

partic processor cat personalpronoun person second

Tests backward compatibility of condition w condrelater

deftest c

If the compartment is dirty then clean it

cat clause

mood imperative

proc type material lex clean

partic affected cat personalpronoun

relaters cond lex then

circum condition cat clause

position front

then yes

proc type ascriptive

partic carrier cat common lex compartment

attribute cat ap lex dirty

Tests backward compatibility of fromloc

deftest c

It comes from the KY

cat clause

proc type material lex come

partic agent cat personalpronoun

circum fromloc cat common lex KY

deftest c

It comes from the KY

cat clause

proc type material lex come

partic agent cat personalpronoun

predmodif origin cat pp

np cat common lex KY

App endix C

Example runs of STREAK

In this app endix I present example runs of the generation system streak Two example runs were already

presented in Section I start by presenting again these two runs more in depth I provide the FDs

enco ding the draft presentation andor the semantic input for those runs and I discuss how they inuence

the application of revision rules I then present three additional runs illustrating other interesting asp ects

of the system namely its paraphrasing p ower b oth at drafttime and revisiontime and the application of

the same revision rules in dierent contexts

C Original example runs with input and draft representation

C Example run

Chaining dierent revision rules to generate a complex sentence

Run shown in Fig C illustrates three asp ects of streak

How it generates a complex lead sentence by rst pro ducing a simple sentence containing only the

obligatory xed facts and then incrementally incorp orating the complementary oating facts through

a series of revisions

How it controls the revision pro cess and decides when to halt it

The variety of revision rules it implements since a dierent one is used at each generation increment

This run was already commented in Section In this section I present this run at further depth by

providing

The input DSS to the drafting pass

The input additional DSS to each revision increment

The internal threelayer representation of the nal draft after the revision pro cess completed

C Building an initial draft

In Fig C the rst line contains calls to the toplevel functions for the draft and revision stages As

input the function draft takes three arguments

The DSS FD containing the input xed facts to convey in the draft In the example of Fig C this

input DSS is built by calling the function dssF The co de of this function is given in Fig C

revise draft dssF sss formflag floatstackF verbose T

Draft

Dallas TX Charles Barkley registered points Friday night as the Phoenix Suns

routed the Dallas Mavericks

Draft lexnum depth

Dallas TX Charles Barkley registered points Friday night as the Phoenix Suns

handed the Dallas Mavericks their th defeat in a row at home

Draft lexnum depth

Dallas TX Charles Barkley registered points Friday night as the Phoenix Suns

handed the Dallas Mavericks their franchise worst th defeat in a row at home

Draft lexnum depth

Dallas TX Charles Barkley registered points and Danny Ainge added Friday

night as the Phoenix Suns handed the Dallas Mavericks their franchise worst th defeat

in a row at home

Draft lexnum depth

Dallas TX Charles Barkley registered points and Danny Ainge came off the bench

to add Friday night as the Phoenix Suns handed the Dallas Mavericks their franchise

worst th defeat in a row at home

Draft lexnum depth

Dallas TX Charles Barkley matched his season record with points and Danny Ainge

came off the bench to add Friday night as the Phoenix Suns handed the Dallas

Mavericks their franchise worst th defeat in a row at home

Draft lexnum depth

SSS DSS

Figure C Applying a dierent rule at each revision increment

A partial SSS constraining the form of the output draft by sp ecifying some desired features in the

initial draft plan The phrase planner can only cho ose options that are compatible with this partial

sp ecication of its output This argument is optional and intro duced by the keyword sss These

surface form constraints are generated by calls to functions such as formflag shown in Fig C

A partial DGS constraining the form of the output draft by sp ecifying some desired features in the

initial draft skeletal lexicosyntactic tree The lexicalizer can only cho ose options that are compatible

with this partial sp ecication of its output This argument is optional and intro duced by the keyword

1

dgs

draft returns a threelayer FD enco ding the initial draft and as a side eect prints the natural language

sentence resulting from the unication of this FD with surge

The function revise takes three arguments

A threelayer FD enco ding the initial draft

A list of lists of oating facts to attempt to incorp orate to the draft Each sublist contains a set of

related oating facts These sublists are ordered by decreasing imp ortance of the facts they contain

Floating fact lists of lists are generated by calls to functions such as floatstackF shown at the top

of Fig C The facts contained in this list of list are shown in gures C to C

The verbose ag which when set to T prints after each revision increment the lexical length lex

num and syntactic depth depth of the revised draft

As explained in Section an initial draft contains four xed facts the main statistic of a player from

the winning team the result of the game including its nal score its lo cation and its date Following

the corpus observations streak always conveys the lo cation as a header The sentence itself thus contains

three main constituents one for each of the three remaining xed facts The form ag sp ecies the typ e of

rhetorical relations that the phrase planner can use to group these three constituents formflag sp ecies

that

The draft must b e hyp otactically structured indicated by the presence of a rels feature at the top

level

Two dep endent constituents must b e group ed in an paratactic structure itself linked to the main

constituent by a temp oral relation indicated by the presence of a feature time elt under

the rels feature

The second element in this paratactic structure must b e itself hypotactical ly structured indicated by

the emb edded struct hypotax feature

When unied with the p ossible toplevel draft structures observed in the corpus and enco ded in the

phrase planner of streak this sp ecication results in sentences like Draft at the top of Fig C In such

sentences the main clause conveys the main statistic with a list of two dep endent constituents as a temp oral

adjunct The rst element of this list is the nominal conveying the date of the game and the second is the

clause conveying the game result

formflag sp ecies that

The draft must b e hyp otactically structured indicated by the presence of a rels feature at the top

level

Two dep endent constituents must b e group ed in an paratactic structure itself linked to the main

constituent by a temp oral relation indicated by the presence of a feature time elt under

the rels feature

The second element in this paratactically structure must b e structured as an event indicated by the

emb edded struct event feature

1

The two optional arguments sss and dgs provide a basic facility for systematically testing the paraphrasing p ower of

streak The development of more p owerful facilities has b een left for future work and is discussed in Section

Surface form constraint forcing the drafter to

attach the game result to the main game statistic as a time adverbial

realize the game result by a full verb clause

eg as WINNER defeated LOSER SCORE

defun formflag setf formflag rels cooccur none

time elts cdr car struct hypotax

Surface form constraint forcing the drafter to

attach the game result to the main game statistic as a time adverbial

realize the game result by a support verb clause

eg while WINNER claimed a SCORE win over LOSER

defun formflag setf formflag rels cooccur none

time elts cdr car struct event

Surface form constraint forcing the drafter to

attach the game result as a coevent subordinate clause

eg leading WINNER to a SCORE win over LOSER

defun formflag setf formflag rels time elts none

Figure C Form ags used in the example runs

It thus shares the two rst constraints with formflag and diers in the third In contrast formflag

sp ecies the converse of the rst two constraints of formflag and formflag ie that the draft must

b e either paratactically structured or if hyp otactically structured then the dep endent constituent must

not b e linked to the head by a temp oral relation as indicated by the none metavalue under the path

rels time elts

C Revising the initial draft

The rst subelement in floatstackF is an historical background fact of typ e streak extension It notes

that the rep orted game marks the th time that Dallas is defeated on its home turf To add this fact to

Draft streak uses the nonmonotonic Nominalization revision rule whose co de was given and explained

in detail in Section This rule is applied to the initial draft game result clause the Phoenix Suns

routed the Dal las Mavericks It replaces the full verb clause pattern WINNER rout LOSER conveying the

game result in Draft by the semantically equivalent supp ort verb clause pattern WINNER hand LOSER

a defeat in Draft Since the game result is now realized by an NP the expression of its consequence the

up dated length of Dallas losing streak can b e concisely conveyed by adjoining the discontinuous ordinal

th in a row to the NP head defeat The restriction of this streak to home games is conveyed

by adjoining another mo dier the PP qualier at home The typ e of nominalization rule used for this

revision is thus Nominalization with Ordinal and Qualifier Adjoin

The second subelement in floatstackF is another historical background fact of typ e record breaking

It brings additional information ab out the preceding streak extension fact by noting that as a result of this

latest extension this streak is now of record length To add this fact to Draft streak uses the monotonic

revision rule Adjoin of Classifier to Nominal This rule is applied on the very nominal that was created

during the preceding nominalization revision their th defeat in a row at home mo difying it with the

classier franchise worst that expresses its record breaking nature This illustrates how the choice of a

revision rule for a given increment constrains in some cases the range of choices for subsequent increments

This typ e of Adjoin revision rule allows for a very concise expression of the added oating fact It do es not

change the syntactic depth of the draft and lengthens it by only two words After this addition the draft is

only words long still comfortably b elow the word limit observed in the corpus

The third subelement in floatstackF is a nonhistorical fact of typ e additional statistic To add this

defun dssF

Dallas TX Charles Barkley scored points Friday night as the Phoenix

Suns defeated the Dallas Mavericks

setf dssF

deepsemcat entity

concept game token phoatdal

attrs setting ents addr deepsemcat entity

concept address token dallastx

attrs city Dallas state TX

date deepsemcat entity

concept date token frinite

attrs dayname Friday daypart night

stats ents statca deepsemcat entity

concept player token barkley

attrs firstname Charles

lastname Barkley

stat deepsemcat entity

concept gamestat token barkleyptatdal

attrs value unit pt

rels stat deepsemcat relation

role gamestatrel token barkleyscoringatdal

args carrier ents statca

stat ents stat

results ents host deepsemcat entity

concept team token mavs

attrs home Dallas franchise Maverick

visitor deepsemcat entity

concept team token suns

attrs home Phoenix franchise Sun

score deepsemcat entity

concept score token phoatdalscore

attrs win lose

rels winner deepsemcat relation

role winner token phoatdalwinner

args game toplevel

winner ents visitor

loser deepsemcat relation

role loser token phoatdalloser

args game toplevel

loser ents host

result deepsemcat relation

role beat token phoatdalresult

args winner ents visitor

loser ents host

Figure C Input containing the four xed facts for Draft in Run

defun floatstackF setf floatstackF adssF adssF

adssF adssF

adssF

adssF

adssF

adssF

defun adssF

Dallas defeat at the hand of Phoenix was its th straight at home

setf adssF

deepsemcat entity

concept game token phoatdal

attrs results ents streak deepsemcat entity

concept streak token dalstreakvspho

attrs card

streakgenelt deepsemcat entity concept game

rels streakgenelt deepsemcat relation

role genelt token dalstreakvsphogenelt

args set ents streak

genelt ents streakgenelt

streakgeneltloser

deepsemcat relation

role loser token dalstreakvspholoser

args game ents streakgenelt

loser deepsemcat entity

concept team token mavs

attrs home Dallas franchise Maverick

streakgenelthost

deepsemcat relation

role host token dalstreakvsphohost

args game ents streakgenelt

host deepsemcat entity

concept team token mavs

attrs home Dallas franchise Maverick

streakext deepsemcat relation

role streakextension token dalstreakvsphoext

args extension toplevel

streak ents streak

Figure C List of oating fact lists in Run with rst subelement

defun adssF

This th straight home defeat by Dallas is a franchise record

setf adssF

deepsemcat entity

concept game token phoatdal

attrs results

ents histostreak deepsemcat entity

concept histostreak token dalstreakvsphorefset

hostlifetime deepsemcat entity concept duration token dallifetime

histostreakgenelt deepsemcat entity concept streak

histostreakgeneltgenelt deepsemcat entity concept game

histostreakextrcard

deepsemcat entity

concept integer token dalstreakvsphorefsetextrcard

rels hostlifetime deepsemcat relation

role lifetime token dallifetimerel

args entity deepsemcat entity

concept team token mavs

lifetime ents hostlifetime

histostreakduration

deepsemcat relation

role duration token dalstreakvsphorefsetdurationrel

args entity ents histostreak

duration ents hostlifetime

histostreakgenelt deepsemcat relation

role genelt token dalvsphorefsetgenelt

args set ents histostreak

genelt ents histostreakgenelt

histostreakgeneltgenelt

deepsemcat relation

role genelt

args set ents histostreakgenelt

genelt ents histostreakgeneltgenelt

histostreakgeneltgeneltloser

deepsemcat relation

role loser token dalvsphorefsetgeneltgeneltloser

args game ents histostreakgeneltgenelt

team deepsemcat entity concept team token mavs

histostreakgeneltgenelthost

deepsemcat relation

role host token dalvsphorefsetgeneltgenelthost

args game ents histostreakgeneltgenelt

team deepsemcat entity concept team token mavs

histostreakextrcard deepsemcat relation

role maxcard

token dalstreakvsphorefsetextrcardrel

args card ents histostreakextrcard

set ents histostreak

histostreakupdate

deepsemcat relation

role token dalbeatfranchiserecordstreakvspho

args streaklen

histostreaklenextr ents histostreakextrcard

Figure C Second oating fact in Run

defun adssF

Danny Ainge scored points

setf adssF

deepsemcat entity

concept game token phoatdal

attrs stats ents statca deepsemcat entity

concept player token ainge

attrs firstname Danny lastname Ainge

stat deepsemcat entity

concept gamestat

token aingeptatdal

attrs value unit pt

rels stat deepsemcat relation

role gamestatrel token aingescoringatdal

args carrier ents statca

stat ents stat

defun adssF

Danny Ainge is a reserve player

setf adssF

deepsemcat entity

concept game token phoatdal

attrs stats ents statcastatus deepsemcat entity

concept reserve token aingereserve

statca deepsemcat entity

concept player token ainge

attrs firstname Danny lastname Ainge

rels statcastatus deepsemcat relation

role playerstatus token aingestatus

args player ents statca

status ents statcastatus

Figure C Third and fourth oating facts in Run

defun adssF

The points performance by Charles Barkley tied his season high

setf adssF

deepsemcat entity

concept game token phoatdal

attrs stats ents histostat deepsemcat entity

concept histostat token barkleyptatdalrefset

histostatduration deepsemcat entity

concept season

token barkleyptatdalrefsetduration

histostatgenelt deepsemcat entity

concept gamestat

attrs unit pt

histostatgeneltca deepsemcat entity

concept player token barkley

attrs firstname Charles

lastname Barkley

histostatextr deepsemcat entity

concept integer

token barkleyptatdalrefsetextr

rels histostatduration deepsemcat relation

role duration

token barkleyptatdalrefsetdurationrel

args set ents histostat

duration histostatduration

histostatgenelt deepsemcat relation

role genelt

token barkleyptatdalrefsetgenelt

args set ents histostat

genelt ents histostatgenelt

histostatgeneltca

deepsemcat relation

role gamestatrel token barkleyptatdalrefsetgeneltca

args carrier ents histostatgeneltca

stat ents histostatgenelt

histostatextr deepsemcat relation

role maxval

token barkleyptatdalrefsetextrrel

args extrval ents histostatextr

set ents histostat

histostatupdate

deepsemcat relation

role token barkleytieptseasonhighatdal

args statval

histostatextr ents histostatextr

Figure C Fifth oating fact in Run

defun adssF

Ainge had assists

setf adssF

deepsemcat entity

concept game token phoatdal

attrs stats ents statca deepsemcat entity

concept player token ainge

attrs firstname Danny lastname Ainge

stat deepsemcat entity

concept gamestat token aingeastatdal

attrs value unit ast

rels stat deepsemcat relation

role gamestatrel token aingepassingatdal

args carrier ents statca

stat ents stat

defun adssF

Barkley grabbed rebounds

setf adssF

deepsemcat entity

concept game token phoatdal

attrs stats ents statca deepsemcat entity

concept player token barkley

attrs firstname Charles lastname Barkley

stat deepsemcat entity

concept gamestat token barkleyrebatdal

attrs value unit reb

rels stat deepsemcat relation

role gamestatrel token barkleyreboundingatdal

args carrier ents statca

stat ents stat

defun adssF

Majerle scored points

setf adssF

deepsemcat entity

concept game token phoatdal

attrs stats ents statca deepsemcat entity

concept player token majerle

attrs firstname Dan lastname Majerle

stat deepsemcat entity

concept gamestat token majerleptatdal

attrs value unit pt

rels stat deepsemcat relation

role gamestatrel token majerlescoringatdal

args carrier ents statca

stat ents stat

Figure C Sixth seventh and eigth oating facts in Run

deepsemcat entity

concept game token phoatdal

attrs setting ents addr deepsemcat entity

concept address token dallastx

attrs city dallas state tx

date deepsemcat entity

concept date token frinite

attrs dayname friday daypart night

stats ents statca deepsemcat entity

concept player token barkley

attrs firstname charles lastname barkley

stat deepsemcat entity concept gamestat token barkleyptatdal

attrs value unit pt

statca deepsemcat entity

concept player token ainge

attrs firstname danny lastname ainge

stat deepsemcat entity

concept gamestat token aingeptatdal

attrs value unit pt

statcastatus deepsemcat entity

concept reserve token aingereserve

histostat deepsemcat entity

concept histostat token barkleyptatdalrefset

histostatduration deepsemcat entity

concept season

token barkleyptatdalrefsetduration

histostatgenelt deepsemcat entity

concept gamestat attrs unit pt

histostatgeneltca deepsemcat entity

concept player token barkley

attrs firstname charles

lastname barkley

histostatextr deepsemcat entity

concept integer

token barkleyptatdalrefsetextr

Figure C DSS layer of nal draft in Run Continued next page

rels stat deepsemcat relation

role gamestatrel

token barkleyscoringatdal

args carrier attrs stats ents statca

stat attrs stats ents stat

stat deepsemcat relation

role gamestatrel token aingescoringatdal

args carrier attrs stats ents statca

stat attrs stats ents stat

statcastatus deepsemcat relation

role playerstatus token aingestatus

args player attrs stats ents statca

status attrs stats ents statcastatus

histostatduration

deepsemcat relation

role duration token barkleyptatdalrefsetdurationrel

args set attrs stats ents histostat

duration attrs stats histostatduration

histostatgenelt

deepsemcat relation

role genelt token barkleyptatdalrefsetgenelt

args set attrs stats ents histostat

genelt attrs stats ents histostatgenelt

histostatgeneltca

deepsemcat relation

role gamestatrel token barkleyptatdalrefsetgeneltca

args carrier attrs stats ents histostatgeneltca

stat attrs stats ents histostatgenelt

histostatextr deepsemcat relation

role maxval token barkleyptatdalrefsetextrrel

args extrval attrs stats ents histostatextr

set attrs stats ents histostat

histostatupdate

deepsemcat relation

role token barkleytieptseasonhighatdal

args statval

histostatextr attrs stats ents histostatextr

Figure C DSS layer of nal draft in Run Continued from previous page and continued next page

results ents host deepsemcat entity concept team token mavs

attrs home dallas franchise maverick

visitor deepsemcat entity concept team token suns

attrs home phoenix franchise sun

score deepsemcat entity concept score token phoatdalscore

attrs win lose

streak deepsemcat entity concept streak

token dalstreakvspho attrs card

streakgenelt deepsemcat entity concept game

histostreak deepsemcat entity concept histostreak

token dalstreakvsphorefset

hostlifetime deepsemcat entity concept duration

token dallifetime

histostreakgenelt deepsemcat entity concept streak

histostreakgeneltgenelt deepsemcat entity concept game

histostreakextrcard

deepsemcat entity concept integer

token dalstreakvsphorefsetextrcard

rels winner deepsemcat relation

role winner token phoatdalwinner

args game toplevel

winner attrs results ents visitor

loser deepsemcat relation

role loser token phoatdalloser

args game toplevel

loser attrs results ents host

result deepsemcat relation

role beat token phoatdalresult

args winner attrs results ents visitor

loser attrs results ents host

streakgenelt

deepsemcat relation role genelt token dalstreakvsphogenelt

args set attrs results ents streak

genelt attrs results ents streakgenelt

streakgeneltloser

deepsemcat relation role loser token dalstreakvspholoser

args game attrs results ents streakgenelt

loser deepsemcat entity concept team token mavs

attrs home dallas franchise maverick

streakgenelthost

deepsemcat relation

role host token dalstreakvsphohost

args game attrs results ents streakgenelt

host deepsemcat entity concept team token mavs

attrs home dallas franchise maverick

streakext deepsemcat relation

role streakextension token dalstreakvsphoext

args extension toplevel

streak attrs results ents streak

Figure C DSS layer of nal draft in Run Continued from previous page and continued next page

hostlifetime deepsemcat relation

role lifetime token dallifetimerel

args entity deepsemcat entity

concept team token mavs

lifetime attrs results ents hostlifetime

histostreakduration

deepsemcat relation

role duration token dalstreakvsphorefsetdurationrel

args entity attrs results ents histostreak

duration attrs results ents hostlifetime

histostreakgenelt

deepsemcat relation

role genelt token dalvsphorefsetgenelt

args set attrs results ents histostreak

genelt attrs results ents histostreakgenelt

histostreakgeneltgenelt

deepsemcat relation

role genelt

args set attrs results ents histostreakgenelt

genelt attrs results ents histostreakgeneltgenelt

histostreakgeneltgeneltloser

deepsemcat relation

role loser token dalvsphorefsetgeneltgeneltloser

args game attrs results ents histostreakgeneltgenelt

team deepsemcat entity concept team token mavs

histostreakgeneltgenelthost

deepsemcat relation

role host token dalvsphorefsetgeneltgenelthost

args game attrs results ents histostreakgeneltgenelt

team deepsemcat entity concept team token mavs

histostreakextrcard

deepsemcat relation role maxcard

token dalstreakvsphorefsetextrcardrel

args card attrs results ents histostreakextrcard

set attrs results ents histostreak

histostreakupdate

deepsemcat relation

role token dalbeatfranchiserecordstreakvspho

args streaklen

histostreaklenextr

attrs results ents histostreakextrcard

Figure C DSS layer of nal draft in Run Continued from previous page

surfsemcat rhetor struct hypotax

root surfsemcat rhetor struct paratax rel teammate

elts car surfsemcat rhetor struct hypotax

root surfsemcat encyclo onto event concept

args agent surfsemcat encyclo onto indiv concept player

names firstname charles lastname barkley

affected surfsemcat rhetor struct hypotax

root surfsemcat encyclo

concept maxval onto indiv

rels duration surfsemcat encyclo

onto indiv

concept season

rels instrument surfsemcat encyclo onto quantity concept gamestat

restrictors value unit pt

cdr car surfsemcat rhetor struct hypotax

root surfsemcat rhetor struct event root transfloc

args agent surfsemcat encyclo onto indiv concept player

names firstname danny

lastname ainge

located root args agent

location surfsemcat encyclo

onto place concept reserve

rels result surfsemcat encyclo onto event concept gamestatrel

args agent surfsemcat encyclo onto indiv

concept root args agent concept

names root args agent names

created surfsemcat encyclo

onto quantity concept gamestat

restrictors value

unit pt

cdr none

rels location surfsemcat encyclo onto place concept address

restrictors city dallas state tx

Figure C SSS layer of nal draft in Run Continued next page

time surfsemcat rhetor struct paratax rel temporalinclusion

elts car surfsemcat encyclo concept date onto time

restrictors dayname friday daypart night

cdr car surfsemcat rhetor struct hypotax

root

surfsemcat rhetor

struct event root transfposs

args agent surfsemcat encyclo

onto indiv concept team ref full

names home phoenix

franchise sun

affected surfsemcat encyclo

onto indiv concept team ref full

names home dallas

franchise maverick

possessed

surfsemcat rhetor struct hypotax

root

surfsemcat rhetor

struct hypotax

root surfsemcat encyclo

onto indiv concept loser

rels possessor fullref

ordinal

surfsemcat rhetor struct hypotax

root surfsemcat encyclo

onto quantity

concept integer

restrictors value

rels cardof surfsemcat encyclo

onto place

concept streak

location surfsemcat encyclo

onto place concept host

compar

surfsemcat rhetor

struct hypotax

root surfsemcat encyclo

onto indiv concept maxcard

rels duration

surfsemcat encyclo

onto indiv

concept lifetime

restrictors

scope

surfsemcat encyclo

onto indiv

concept team

ref empty

rels score surfsemcat encyclo

onto quantity concept score

restrictors win lose

cdr none

Figure C SSS layer of nal draft in Run Continued from previous page

cat clause complex conjunction

distinct car cat clause tense past

proc type material lex tie

partic agent cat compoundproper

head cat personname

firstname lex charles

lastname lex barkley

affected cat common

classifier cat noun

syntfunct classifier lex season

possessor cat personalpronoun gender masculine

head lex record

predmodif instrument cat pp

np cat measure

quantity value

unit lex point

cdr car cat clause tense past

proc type composite relationtype locative

effective no lex come

partic agent cat compoundproper

head cat personname

firstname lex danny

lastname lex ainge

located distinct cdr car partic agent

location cat pp prep lex off

np cat common lex bench

circum

result cat clause tense past mood toinfinitive

proc type material effecttype creative lex add

partic agent distinct cdr car circum result controlled

created cat measure

quantity value

unit gap yes lex point

punctuation before none

Figure C DGS layer of nal draft in Run Continued next page

circum location cat address city lex dallas state lex tx position header

time cat list

distinct

car cat date dayname lex friday daypart lex night

cdr car cat clause tense past

mood boundadverbial position end binder lex as

proc type composite relationtype possessive lex give

partic agent cat compoundproper

head cat teamname

home lex phoenix

franchise lex sun

number plural

affected cat compoundproper

head cat teamname

home lex dallas

franchise lex maverick

number plural

possessor circum time distinct cdr car partic affected

possessed cat common

possessor cat personalpronoun

number plural

head cat measure

quantity cat compoundordinal

numeral cat ordinal

value

complement

cat pp

prep lex in

np cat common

definite no

lex row

unit lex defeat

qualifier

cat pp

prep lex at

np cat common

determiner none lex home

classifier

cat nouncompound

syntfunct classifier

classifier cat noun

syntfunct classifier

lex franchise

head lex worst

predmodif score cat score

win value lose value

cdr none

Figure C DGS layer of nal draft in Run Continued from previous page

fact to Draft streak uses the monotonic revision rule Coordinative Conjoin of Clause This rule was

applied to the main statistic clause Charles Barkley registered points b ecause this additional statistic

also concerns a player of the winning team Furthermore since they are b oth scoring p erformances streak

exploits this fact and cho oses to elide the head of the ob ject in the added conjoined clause resulting in the

phrase Danny Ainge added instead of Danny Ainge added points This illustrates how streak

opp ortunistically takes advantage of the particular draft context into which a oating fact is woven to cho ose

a more concise expression for that fact streak also uses this context to make the most appropriate lexical

choice as illustrated by the choice of the verb to add for this second statistic Such a verb can b e chosen

only in this particular context It would b e inappropriate for example to realize the main statistic for which

streak chose the more general verb to register in this particular run The co de for the choice of such

verbs was given in Section How the reviser passes such contextual information to the lexicalizer was

also explained in that section

The fourth subelement in floatstackF is nonhistorical fact of typ e player status Just as the second

oating fact underlined the signicance of the rst by conveying its record breaking nature this fourth fact

underlines the signicance of the third It notes that the player whose scoring statistic was just added to

the draft Danny Ainge added is a reserve player making the fact that he scored that many p oints

all the more remarkable To add this fact to Draft streak uses the monotonic revision rule Absorb

of Clause into Clause as Result Moreover it uses the sp ecialization of this revision rule that involves

the side transformation Agent Control This sp ecialization is chosen when streak notices that b oth the

absorb ed and absorbing clauses share the same agent It allows the agent of the absorb ed one which was

part of the original draft to b e deleted resulting in Danny Ainge came o the bench to add instead

of Danny Ainge came o the bench for Danny Ainge to add thus opp ortunistically gaining space and

uency This example illustrates the capability of streak to use idioms such as the expression to come o

the bench which conveys that a player is a reserve

The fth subelement in floatstackF is a historical fact of typ e record equalling It concerns the

main statistic To add this fact to Draft streak uses the revision rule Adjunctization of Created

into Instrument It moves the ob ject of the main statistic clause that was lling the Created role in that

2

clause to an Instrument role in order to accommo date the added record as ob ject The equalling asp ect

of this record is expressed as the new main verb to match replacing the original verb to register The

action explicitly conveyed by this original verb is now implicitly conveyed by to match since matching

a record can only come as a consequence of a p erformance This example thus illustrates the ability of

streak to opp ortunistically take advantage of the addition of a new fact to gain space by making implicit

part of the realization of another fact already in the draft It also demonstrates how streak takes into

account stylistic conventions observed in the corpus Compare the addition of this fth oating fact with

the addition of the second one They b oth concern a record the dierence b etween them b eing that the

second fact expresses that a record was broken and the fth one that it was merely equal led This dierence

which could seem minor at rst triggers the use of entirely dierent revision rules the monotonic Adjoin

for the second fact and the nonmonotonic Adjunctization for the fth one This dierence in strategies

implements the stylistic convention observed among sp orts or sto ck market writers that mentioning of a

record up date event without explicitly sp ecifying whether it is of the breaking or equalling typ e implies that

it is of the breaking typ e This convention allows streak to use the simple and concise revision rule Adjoin

for record breaking events note how nothing in Draft sp ecies whether the th defeat of Dallas actually

breaks or merely equals their longest losing streak Using such an implicit form for record equalling events

as well would b e misguiding however The need to keep rep orts concise must b e balanced with the need to

keep these two typ e of events unambiguously distinguishable It is in order to b e explicit ab out the equalling

typ e of the record up date event added in the fth increment that streak uses the less concise and more

complex Adjunctization revision

After the addition of this fth oating fact the draft is only two words away from the maximum length

of observed in the corpus Thus unless the next subelement in floatstackF can b e added with

only two more words it will not t in this lead sentence summary This next subelement is an additional

statistic the passing p erformance of Danny Ainge The most concise way it can b e accommo dated in the

2

cf sections B and B of App endix B for the denition of the thematic roles used in streak

draft is by revising the nominal realizing the scoring statistic of this player that was added during the

third revision increment and which is already reduced to the cardinal numb er streak applies the

revision rule Coordinative Conjoin of Nominal to this nominal yielding to add and assists

This revision thus adds three new words in b old while not deleting any and thus pushes the revised

draft over the length limit streak thus halts the revision pro cess without printing the draft resulting

from this nal revision As nal value the revise function returns the threelayer FD representing the

previous draft which was under the complexity limits Since this FD is very large its b o dy is not shown

in Fig C but its presence is signaled by the abbreviation SSS DSS The full detailed

b o dy for this nal draft representation is given in three gures each extending over several pages The

DSS layer of the draft in shown in Fig C its SSS layer in Fig C and its DGS layer in Fig C

For the sake of legibility the values in an outer layer that are inherited from an inner layer are shown as

copies in this series of gures However in the actual draft representation generated and then revised by

the system these values are really p ointers from the outer layer to the inner layer For example the value

of under the path circum time distinct cdr car predmodif score win value is the atom in

the DGS of Fig C In fact the actual representation generated by streak value is really the path

attr results ents score attrs win which p oints to the winning team score in the DSS of Fig C

C Example run

Choice of revision rule constrained by the surface form of the draft

Run shown in Fig C illustrates how in streak the choice of a revision rule to add a given oating

fact onto the draft is sensitive not only to the content of the draft but to its surface form as well It starts

with two calls to the function draft to build two alternative draft forms from the same input but a with

dierent realization constraint Run was already presented in Section C In this section I show the

output again but this time I also give

The threelayer internal representation of the two synonymous drafts which trigger the dierent revision

rules

The precondition of the Nominalization revision rule showing why it gets triggered for one of the

two drafts and not the other

The common input to these two revision stages is a DSS FD enco ding the four xed facts to convey in

b oth draft It is built by calling the function dssC and shown in Fig C For the rst call to draft the

realization constraint is formflag whose value was given in Fig C It constrains the output draftC

to express the game result as a ful l verb clause that follows the pattern WINNER fullverb LOSER and

is sub ordinated to the main statistic clause as a time adjunct In this particular run the full verb chosen

randomly by the lexicalizer is to beat

For the second call the realization constraint is formflag whose value also given in Fig C

constrains the output draftC to express the game result as a support verb clause following the pattern

3

WINNER supp ortverb LOSER nominal In this particular run the supp ortverbnominalhead collo ca

tion chosen randomly by the lexicalizer is to nail down a win

Once these two alternative synonymous draft sentences have b een built the function revise is then

called on each of them with the same additional oating fact adssC as second parameter revise is

the function to call for a single revision increment It implements the revision rule interpreter describ ed

in Section In contrast the function revise called for Run is used for chaining revisions and

implements the revision monitor describ ed in Section revise works by traversing the list of lists of

oating facts and rep eatedly calling revise on each fact adssC enco des a losing streak extension for the

Boston Celtics To incorp orate this oating fact to draftC streak uses the revision rule Nominalization

In contrast to incorp orate this same oating fact on the synonymous but linguistically distinct draftC

streak uses the revision rule Adjunctization In each case the choice of one revision rule over the other

3

Like formflag formflag also constrains the game result clause to b e sub ordinated to the main statistic clause as

a time adjunct

setf draftC draft dssC sss formflag Draft form

Hartford CT Karl Malone hit for points Friday night as the Utah Jazz beat the

Boston Celtics

SSS DSS

setf draftC draft dssC sss formflag Draft form

Hartford CT Karl Malone had points Friday night as the Utah Jazz nailed down a

win over the Boston Celtics

SSS DSS

revise draftC adssC Revising Draft using Nominalization

Hartford CT Karl Malone hit for points Friday night as the Utah Jazz brought the

Boston Celtics their sixth consecutive setback at home

SSS DSS

revise draftC adssC Revising Draft using Adjunctization

Hartford CT Karl Malone had points Friday night as the Utah Jazz extended the

Celtics homecourt losing streak to six with a win over Boston

SSS DSS

Figure C streak using dierent revision rules dep ending on the surface form of the draft

defconj nominalization

lhs bls materialbasicrescluster

adss attrs results losstreakext

tool nominalization

rhs

defconj materialbasicrescluster

cat under clause

partic agent sss sss root args agent

affected sss sss root args affected

predmodif score sss sss rels score

sss surfsemcat under rhetor

struct under hypotax

root surfsemcat under encyclo

args agent surfsemcat under encyclo

dss dss args winner

affected surfsemcat under encyclo

dss dss args loser

dss deepsemcat under relation

role under beat

args winner deepsemcat under entity

concept under team

loser deepsemcat under entity

concept under team

rels score surfsemcat under encyclo

dss deepsemcat under entity

concept under score

defconj losstreakext

streakext

rels streakgeneltloser deepsemcat under relation

role under loser

defconj streakext

ents streak streak

streakgenelt deepsemcat under entity

concept under game

rels streakgenelt deepsemcat under relation

role under genelt

args set ents streak

genelt ents streakgenelt

streakext deepsemcat under relation

role under streakextension

args extension under toplevel

streak ents streak

defconj streak deepsemcat under entity

concept under streak

attrs card GIVEN

Figure C Preconditions to using nominalization for losing streak extensions

deepsemcat entity

concept game token utaatbos

attrs setting ents addr deepsemcat entity

concept address token hartfordct

attrs city hartford state ct

date deepsemcat entity

concept date token frinite

attrs dayname friday daypart night

stats ents statca deepsemcat entity

concept player token kmalone

attrs firstname karl lastname malone

stat deepsemcat entity

concept gamestat token kmaloneptatbos

attrs value unit pt

rels stat deepsemcat relation

role gamestatrel token kmalonescoringatbos

args carrier sss dss attrs stats ents statca

stat sss dss attrs stats ents stat

results ents host DEEPSEMCAT entity

CONCEPT team token celts

attrs home boston franchise celtic

visitor DEEPSEMCAT entity

CONCEPT team token jazz

attrs home utah franchise jazz

score DEEPSEMCAT entity

CONCEPT score token utaatbosscore

attrs win lose

rels winner deepsemcat relation

role winner token utaatboswinner

args game toplevel

winner sss dss attrs results ents visitor

loser deepsemcat relation

role loser token utaatbosloser

args game toplevel

loser sss dss attrs results ents host

result DEEPSEMCAT relation

ROLE beat token utaatbosresult

ARGS WINNER SSS DSS ATTRS RESULTS ENTS VISITOR

LOSER SSS DSS ATTRS RESULTS ENTS HOST

Figure C DSS layer of b oth drafts in Run

surfsemcat rhetor

struct hypotax

root surfsemcat encyclo

onto event concept gamestatrel

args agent surfsemcat encyclo

onto indiv concept player

names firstname karl lastname malone

created surfsemcat encyclo

concept gamestat onto quantity

restrictors value unit pt

rels time surfsemcat rhetor

struct paratax rel temporalinclusion

elts car surfsemcat encyclo

concept date onto time

restrictors dayname friday daypart night

cdr car SURFSEMCAT rhetor

STRUCT hypotax

ROOT SURFSEMCAT encyclo

onto event concept beat

ARGS AGENT SURFSEMCAT encyclo

onto indiv concept team ref full

names home utah franchise jazz

AFFECTED SURFSEMCAT encyclo

onto indiv concept team

ref full

names home boston

franchise celtic

RELS SCORE SURFSEMCAT encyclo

ONTO quantity concept score

restrictors win lose

cdr none

location surfsemcat encyclo

concept address onto place

restrictors city hartford state ct

Figure C SSS layer of the rst initial draft in Run

cat clause

tense past

proc type material effecttype creative lex hit for

partic agent cat compoundproper

head cat personname

firstname lex karl lastname lex malone

created cat measure

quantity value unit lex point

circum location position header cat address city lex hartford

state lex ct

time cat list

fills timerel

distinct car cat date

dayname lex friday daypart lex night

cdr car CAT clause

tense past mood boundadverbial

position end binder lex as

proc type material lex beat

PARTIC AGENT cat compoundproper

head cat teamname

home lex utah

franchise lex jazz

number plural

AFFECTED cat compoundproper

head cat teamname

home lex boston

franchise lex celtic

number plural

PREDMODIF SCORE cat score

win value

lose value

cdr none

Figure C DGS layer of the rst initial draft in Run

surfsemcat rhetor

struct hypotax

root surfsemcat encyclo

onto event concept gamestatrel

args agent surfsemcat encyclo

onto indiv concept player

names firstname karl lastname malone

created surfsemcat encyclo

onto quantity concept gamestat

restrictors value unit pt

rels location surfsemcat encyclo

concept address onto place

restrictors city hartford state ct

time surfsemcat rhetor

struct paratax rel temporalinclusion

elts car surfsemcat encyclo

concept date onto time

restrictors dayname friday daypart night

cdr car SURFSEMCAT rhetor

STRUCT event ROOT activity

args agent surfsemcat encyclo

onto indiv concept team ref full

names home utah franchise jazz

range surfsemcat rhetor

struct hypotax alreadymapped winner

root surfsemcat encyclo

concept winner onto indiv

rels score surfsemcat encyclo

concept score onto quantity

restrictors win

lose

opposition

surfsemcat encyclo

onto indiv

concept team

ref full

names home boston

franchise celtic

cdr none

Figure C SSS layer of the second initial draft in Run

cat clause

tense past

proc type material effecttype creative lex have

partic agent cat compoundproper

head cat personname

firstname lex karl lastname lex malone

created cat measure

quantity value unit lex point

circum location cat address

position header city lex hartford state lex ct

time cat list

distinct car cat date

dayname lex friday daypart lex night

cdr car CAT clause

tense past mood boundadverbial

position end binder lex as

proc type material effective no lex nail down

PARTIC AGENT cat compoundproper

head cat teamname

home lex utah

franchise lex jazz

number plural

range cat common

definite no

classifier cat score

win value

lose value

qualifier

cat pp

prep lex over

np cat compoundproper

head cat teamname

home lex boston

franchise lex celtic

number plural

head lex win

cdr none

Figure C DGS layer of the second initial draft in Run

is motivated by the surface form of the resp ective drafts involved Nominalization realizes a new fact by

mo diers attached to a nominal resulting from the transformation of a fullverb clause into a supp ortverb

clause Adjunctization conversely replaces a supp ortverb clause by a fullverb clause incorp orating the

new fact by a fullverb with a new ob ject while displacing the original ob ject to an adjunct p osition Since

draftC follows a fullverb pattern only Nominalization and not Adjunctization is applicable to it For

draftC following a supp ort verb pattern it is just the opp osite There is no game result NP in draftC

to b e adjunctized and no game result full verb in draftC to b e nominalized It is precisely b ecause the

applicability of revision rules such as the two ab ove is dep endent on surface form that the presence of the

DGS layer in the draft representation of streak is required

The surface form constraints in the precondition of Nominalization are given in Fig C The common

DSS for b oth draftC and draftC is given in Fig C The SSS of each draft form are resp ectively given

in gures C and C and the corresp onding DGS in gures C and C In each of these FDs the

attributes that are tested by the precondition of the nominalization revision rules are upp ercased An

example attribute of draftC whose value matches a precondition to the application of Nominalization

is the attribute struct under the path rels time elts cdr car in Fig C containing the SSS layer

of draftC Its value hypotax matches that of the corresp onding attribute of the Nominalization pre

condition shown on line in Fig C Now consider the same attribute in Fig C containing the

SSS layer of draftC Its value is event thus causing the unication of draftC with the precondition

of Nominalization to fail This example also illustrates the structure traversal of the drafts DGS and SSS

layers that the revision rule interpreter p erforms while trying to apply a given revision rule Note how the

match for draftC and mismatch for draftC just discussed o ccur when the SSS part of precondition for

the Nominalization revision rule is unied with the subFD under the path rels time elts cdr car in

the SSS layer of the draft This unication is attempted only after failure to unify the precondition of the

4

revision rule with any emb edding subFD

C Additional example runs

In the following sections I present three additional example runs describing other interesting asp ects of

streak The rst of these additional runs Run consists of draft building stages that illustrate the

paraphrasing p ower enco ded in the phrase planner and the lexicalizer It shows a variety of draft forms

generated from the same set of input xed facts The next additional run Run consists of parallel revision

increments that illustrates the paraphrasing p ower enco ded in the reviser It shows how the same oating fact

can b e incorp orated into the same initial draft form by using dierent revision rules resulting in a variety of

revised draft forms The last additional run Run provides another example incremental complex sentence

generation Whereas Run illustrated the variety of revision rules by applying a dierent rule at each

increment Run illustrates the applicability of the same rules in dierent textual contexts by rep eatedly

applying the same rules at a dierent levels inside the draft structure

C Example run Generating draft paraphrases

As explained in Section there are multiple sources of paraphrasing p ower in streak There are

two such sources at play at draft time alternative phrase planning rules and alternative lexicalization rules

Run shown in Fig C illustrates the eects of these two sources It contains draft building stages

from the same semantic input This input generated by a call to the function dssE is similar to the draft

stage input dssF of Run that was shown in Fig C It contains exactly the same concepts only dierent

instances of each

Each draft in Run is built by calling the function draft Paraphrases to are pro duced by constraining

the syntagmatic structure of the draft using the keyword parameter sss with one of the three form ags

given in Fig C formflag used in paraphrases and forces streak to cho ose an hyp otactic

toplevel structure where the main statistic is the matrix clause and the game result is a dep endent clause

4

Or any revision rule dep ending whether a sp ecic revision rule is indicated in the input to the reviser

progn draft dssE sss formflag values Paraphrase

Orlando FL Shaquille ONeal rattled off points Friday night as

the Orlando Magic beat the Toronto Raptors

progn draft dssE sss formflag values Paraphrase

Orlando FL Shaquille ONeal tossed in points Friday night as the

Orlando Magic posted a triumph against the Toronto Raptors

progn draft dssE sss formflag values Paraphrase

Orlando FL Shaquille ONeal chipped in points Friday night

rallying the Orlando Magic to a win over the Toronto Raptors

progn draft dssE sss formflag values Paraphrase

Orlando FL Shaquille ONeal hit for points Friday night while the

Orlando Magic routed the Toronto Raptors

progn draft dssE sss formflag values Paraphrase

Orlando FL Shaquille ONeal tallied points Friday night while the

Orlando Magic clinched a victory over the Toronto Raptors

progn draft dssE sss formflag values Paraphrase

Orlando FL Shaquille ONeal finished with points Friday night

pacing the Orlando Magic to a win over the Toronto Raptors

progn draft dssE sss formflag values Paraphrase

Orlando FL Shaquille ONeal stroke for points Friday night as the

Orlando Magic triumphed over the Toronto Raptors

progn draft dssE sss formflag values Paraphrase

Orlando FL Shaquille ONeal fired in points Friday night as the

Orlando Magic recorded a victory over the Toronto Raptors

progn draft dssE sss formflag values Paraphrase

Orlando FL Shaquille ONeal stroke for points Friday night

fueling the Orlando Magic to a triumph over the Toronto Raptors

progn draft dssE values Paraphrase

Orlando FL Shaquille ONeal fired in points Friday night pushing

the Orlando Magic to a triumph over the Toronto Raptors

progn draft dssE values Paraphrase

Orlando FL Shaquille ONeal pumped in points Friday night

lifting the Orlando Magic to a triumph over the Toronto Raptors

progn draft dssE values Paraphrase

Orlando FL Shaquille ONeal totaled points Friday night while the

Orlando Magic coasted past the Toronto Raptors

progn draft dssE values Paraphrase

Orlando FL Shaquille ONeal logged points Friday night while the

Orlando Magic beat the Toronto Raptors

Figure C streak generating draft paraphrases

It also forces to link these two clauses by a temp oral conjunction and the game result to b e expressed using a

ful l verb clause patterns The dierences b etween these three paraphrases is thus limited to the paradigmatic

choices of op enclass words tting in this structure For example the verb expressing the main statistic to

rattle o in to hit for in and to strike for in the conjunction linking the matrix and dep endent

clauses as in and vs while in and the verb expressing the game result to beat in to rout

in and to triumph over in

formflag used in paraphrase and sp ecies the same toplevel structure than formflag but

forces the game result to b e expressed using a support verb clause pattern The paradigmatic variations

within this pattern concern the choice of supp ort verb to post in to clinch in and to record in

the choice of head noun in the ob ject NP of this supp ort verb triumph in vs victory in and and

the choice of prep osition intro ducing the losing team against in vs over in and

formflag used in paraphrases and forces streak to link the main statistic clause to the

expression of the game result by a coevent clause The paradigmatic variety in this case is the same than

for formflag with additional choice of the head verb for the linking coevent clause to ral ly in to

pace in and to fuel in

The paradigmatic variations in all paraphrases of Fig C result from random choices of lexicalization

rules In paraphrases to the syntagmatic variations that were constrained by a formag in paraphrase

to also result from random choices as well of phrase planning rules since the function draft is called

without an sss keyword parameter In the rst two of these randomly structured drafts a coevent clause

is chosen to link the game result to the main statistic whereas in the last two a temp oral conjunction is

chosen

C Example run Using revision to generate paraphrases

Having illustrated the two sources of paraphrasing p ower onto which streak can rely to vary its output

during the drafting stage I now turn to the four such sources onto which streak can rely during a revision

increment alternative revision rules alternative draft constituents on which to p erform the revision and

again alternative phrase planning rules and lexicalization rules for realizing the oating fact in the context

of the revision

Run shown in Fig C illustrates the eects of these four sources It contains a call to the function

revisepara which allows for parallel revision increments all incorp orating the same additional fact to

the same draft but each with a dierent revision rule It thus take as input three parameters a draft an

additional content unit and a list of revision rule names revisepara rst prints its draft parameter It

5

then traverses its revision rule name list and rep eatedly calls the function revise cf sections C for a

description of this function to pro duce a dierent revision of the draft incorp orating the additional content

6

unit In Run the draft is a basic sentence containing only the four xed concepts An example of semantic

input for this concept combination was given in Fig C The main clause realizes the main statistic and a

coevent dep endent clause links this main statistic to the nominal realizing the game result The additional

content unit is of typ e losing streak extension An example semantic input for this concept was given in

Fig C

The rst two elements in the revision rule name list rulesE are Appositive Conjoin of Nominal The

dierence b etween Drafta and Draftb illustrates how streak can generate paraphrases by applying the

same revision rule at dierent levels inside the draft structure Drafta results from the application of this

app osition revision rule to the b ottom level prop er nominal the Toronto Raptors of Draft This nominal

refers to the losing team whose streak was extended by the result of the rep orted game It is emb edded

as a qualifying PP complement in another nominal conveying this game result a triumph over the

Toronto Raptors Draftb results from the application of the same app osition revision rule but this time onto

this emb edding game result nominal as a whole In these two examples the phrase planner and lexicalizer

building the linguistic form of the added constituent resp ectively pro duce losers of seven straight in Drafta

5

6

revisepara works by sideeects and always returns nil

revisepara draft dssE sss formflag adssE rulesE

Draft

Orlando FL Shaquille ONeal hit for points Friday night powering the Orlando

Magic to a triumph over the Toronto Raptors

Draft a

Orlando FL Shaquille ONeal hit for points Friday night powering the Orlando

Magic to a triumph over the Toronto Raptors losers of seven straight

Draft b

Orlando FL Shaquille ONeal hit for points Friday night powering the Orlando

Magic to a triumph over the Raptors Torontos seventh setback in a row

Draft c

Orlando FL Shaquille ONeal hit for points Friday night powering the Orlando

Magic to a triumph over the Toronto Raptors who lost their seventh straight

Draft d

Orlando FL Shaquille ONeal hit for points Friday night powering the Orlando

Magic to a triumph over Toronto which sent the Raptors to their seventh loss

in a row

Draft e

Orlando FL Shaquille ONeal hit for points Friday night powering the Orlando

Magic to a triumph over Toronto handing the Raptors their seventh defeat in a

row

Draft f

Orlando FL Shaquille ONeal hit for points Friday night powering the Orlando

Magic to a triumph over Toronto and riding the Raptors losing streak to

seven

NIL

Figure C streak generating paraphrases by using a variety of revision rules

and Torontos seventh setback in a row These two forms corresp ond to the application of dierent phrase

planning rules For example whereas in the rst the streak semantic element is mapp ed onto a premo difying

adjective straight in the second it is mapp ed onto a synonymous p ostmo difying PP in a row

The subsequent two elements in the revision rule name list are Adjoin of Relative Clause to Nominal

This rule is applied to the same two constituents onto which Appositive Conjoin of Nominal where pre

viously applied resp ectively resulting in Draftc and Draftd Again the dierence in draft constituents onto

which the rule is applied results in dierent linguistic forms for the new constituent attached by the revision

For example in the relative clause who lost their seventh straight adjoined to the emb edded prop er nominal

the Toronto Raptors the losing nature of the streak is expressed by a verb underlined whereas in the

adjoined to the emb edding common nominal relative clause which sent the Raptors to their seventh loss

a triumph over the Toronto Raptors it is expressed by a noun underlined

The next element in the revision rule name list rulesE is Adjoin of Nonfinite Clause to Nominal

It diers from the previous rule only in terms of the typ e of clause adjoined by the revision Drafte results

from the application of this rule to the same common nominal constituent onto which b oth Appositive

Conjoin of Nominal and Adjoin of Relative Clause to Nominal were applied to resp ectively generate

Draftb and Draftc The contrast b etween Draftb Draftd and Drafte highlights the paraphrasing p ower

provided by having a variety of revision rules applicable for the same purp ose in the same context The

in a contrast b etween loss in the supplemental relative clause which sent the Raptors to their seventh loss

row of Draftd and defeat in the supplemental nonnite clause handing the Raptors their seventh defeat

in a row in Drafte illustrates how random choices of lexicalization rules also add to streak paraphrasing

p ower during revision

The next element in the revision rule name list rulesE is Coordinative Conjoin of Clauses This

rule is applied to a higher level draft constituent than in any of the previous revision increments of this

example run the whole coevent clause linking the game result to the main statistic This constituent is

four level higher in the draft syntactic three than the constituent onto which the revision rule was applied

than to generate Drafta and Draftc

C Example run

Applying the same revision rules in dierent contexts

The last example run shown in Fig C provides another case of incremental complex sentence gen

eration As opp osed to the rst case Run presented in Section C where a dierent revision rule

was applied at each increment two revision rules are applied twice each during the run in dierent textual

contexts provided by the evolving draft During the revision from Draft to Draft the rule Coordinative

Conjoin of Clause is applied to the toplevel clause of the draft expressing the main statistic This revision

adds a new player statistic After the revision the dep endent clauses that were sub ordinated to this main

statistic clause b ecome sub ordinate to the whole new conjunction Karl Malone provided points and

John Stockton added During the revision from Draft to Draft the same revision rule is applied

again to add another statistic by the same player This time the application is lo cal to the second element

of the conjunction built by the rst application of that rule It yields the emb edded conjunction John

Stockton added and had assists Similarly the revision rule Adjoin of Classifier to Nominal

is applied twice to add an historical fact of typ e record It is rst applied during the revision from Draft

to Draft to the NP summarizing Sto cktons scoring p erformance and then during Draft to Draft

to the NP assist summarizing Sto cktons passing p erformance The last revision in this run adds a

historical fact of typ e winning streak extension It uses the revision rule Appositive Conjoin of Nominal

revise draft dssA sss formflag floatstackA verbose T

Draft

Los Angeles Karl Malone provided points Saturday fueling the Utah Jazz to a

win against the Los Angeles Clippers

Draft lexnum depth

Los Angeles Karl Malone provided points and John Stockton added Saturday

fueling the Utah Jazz to a win against the Los Angeles Clippers

Draft lexnum depth

Los Angeles Karl Malone provided points and John Stockton added a season record

Saturday fueling the Utah Jazz to a win against the Los Angeles Clippers

Draft lexnum depth

Los Angeles Karl Malone provided points and John Stockton added a season record

and had assists Saturday fueling the Utah Jazz to a win against the

Los Angeles Clippers

Draft lexnum depth

Los Angeles Karl Malone provided points and John Stockton added a season record

and had a league high assists Saturday fueling the Utah Jazz to a win

against the Los Angeles Clippers

Draft lexnum depth

Los Angeles Karl Malone provided points and John Stockton added a season record

and had a league high assists Saturday fueling the Utah Jazz to their fourth

straight win a win against the Los Angeles Clippers

SSS DSS

Figure C Complex sentence generation where the same revision rule is used at dierent draft levels

App endix D

Partially automating corpus analysis

the CREP software to ol

A p ervasive subtask of corpus analysis for generation knowledge acquisition is to search the corpus for

o ccurrences of sp ecic lexical items andor syntactic forms These items and forms can b e searched for their

own sake or as marks of a semantic message class Such a search can b e done by hand or by writing a scanning

program sp ecic to the search After exp erimenting with b oth metho ds I felt the need for a software to ol

automatically pro ducing and running a scanning program from a exible and highlevel sp ecication of the

items to search for

To address this need I initiated and sup ervised the development of crep a system that retrieves in

a corpus all the sentences containing a lexicosyntactic pattern sp ecied as a regular expression of words

andor partofsp eech tags crep was implemented by Duford It is written on top of flex and currently

uses Churchs statistical tagger as the default partofsp eech tagger

In this app endix I give an overview of crep and its use in assisting corp ora analysis I rst briey survey

the syntax of crep I then discuss in detail three examples of corpus data analysis using crep The rst

example illustrates usage of crep during the knowledge acquisition phase of the development of a generation

system It shows how to use crep for the acquisition of lexical entries for a generator The two other

examples usage of crep during the evaluation phase of the development of a generation system The second

example shows how to use crep to search a test corpus for o ccurrences of a given realization pattern ie

a given linguistic expression of a given domain concept combination The pro cess of enco ding a realization

pattern as a crep expression is detailed on this second example The third example shows how to use crep

to assess the p ortability of the knowledge structures used by a generator in the case at hand the revision

1

as a pair of crep expression is rules of streak The pro cess of enco ding the signature of a revision rule

detailed in this third example

D A brief overview of CREP syntax

In a crep expression the following op erators can b e used over words over tags and recursively over crep

expressions

exp exp sp ecifying simple coo ccurrence of exp and exp

exp exp constraining exp to app ear b efore exp

exp N exp constraining exp to app ear at a minimum distance of N words b efore exp

exp N exp constraining exp to app ear at a maximum distance of N words b efore exp

1

See Section for the denition of the signature of a revision rule

exp N exp constraining exp to app ear at the exact distance of N words b efore exp

exp j exp sp ecifying the o ccurrence of either exp or exp

exp sp ecifying the optional presence of exp

A crep expression can also contain

The BEG keyword sp ecifying the b eginning of sentence p osition

The END keyword sp ecifying the end of sentence p osition

The escap e op erator indicating that the expression surrounded by this op erator should b e

interpreted in flex syntax instead of crep syntax this escaping op erator makes all the flex op erators

usable in crep expressions

In order to mo dularly develop crep expressions and reuse them in many searches they can b e put in a

denition le that crep accepts as input in addition to the main expression and the corpus to search The

other facilities of crep include

An option to output the sp ecic substring that matched the input expression in addition to the whole

sentence containing it

An option to lter sentences matching dierent expressions in dierent les all at once This option

also allows simulation of at most one exactly one and zero ie negation semantics with resp ect

to the numb er of input expression matches inside a sentence the default semantics b eing at least

one

An option to overwrite the built in sentence delimiter using declarative rules

See Duford for a detailed presentation of these op erators and options with many examples

D Using CREP for lexical knowledge acquisition

Its exibility and userfriendliness make crep a very useful to ol at any p oint during the development of a

corpusbased language generation application In the present work crep proved useful b eyond the systematic

typ e of searches involved in the evaluation phase and exemplied in the next section I also used it for the

opportunistic typ e of searches needed during the system implementation phase

For example the following situation arose during the implementation of streaks surface mapping rules

One of these rules sp ecies what verbs can b e used to express the reb ounding statistic of a player At the

moment of writing this rule I rememb ered three such verbs

to grab sp ecic to reb ounding statistics

to have general for all typ es of statistics

to add general for all typ es of statistics but invalid as the op ening statistic of the rep ort

In order to widen the lexical coverage of streak I searched for alternative verbs lexicalizing a reb ounding

statistics in the corpus The crep command to p erform this search was

cat base crep w e PLAYER REB d defexprdef

which instructs crep to retrieve all the corpus sentences where an expression referring to a player is followed

at a distance of at least four words by the expression of a word used to refer to reb ounds The detailed

meaning of the options and parameters inside this command is the following

base is the path of the directory containing the corpus les

w is the crep option to output the matching expression together sentence containing it

BOSTON UPI Kevin Gamble scored p oints Monday night and Jo e Kleine added a careerhigh reb ounds

sending the Boston Celtics to a victory over the Sacramento Kings for their eighth straight triumph

Jo eNPNP KleineNPNP addedVBD aAT careerhighJJ CD reb ounds

MIAMI UPI scored p oints and grabb ed a franchiserecord reb ounds Wednesday

night to lead the Miami Heat to their seasonhigh fourth straight win over the Washington Bullets

RonyNPNP SeikalyNPNP grabb edVBD aAT franchiserecordJJ CD reb ounds

AUBURN HILLS Mich UPI Jo e Dumars scored p oints and Dennis Ro dman hauled in reb ounds Friday

night to lead the Detroit Pistons to their fth straight victory a win over the Cleveland Cavaliers

DennisNPNP Ro dmanNPNP hauledVBN inIN CD reb ounds

ATLANTA UPI scored p oints and Kevin Willis pulled down reb ounds to send the

Atlanta Hawks to a victory over the Los Angeles Clipp ers Saturday night

KevinNPNP WillisNPNP pulledVBD downIN CD reb ounds

DENVER UPI Ro okies Mark Macon scored p oints and Dikemb e Mutomb o ripp ed down reb ounds Thurs

day night leading the Denver Nuggets to an victory over the Minnesota Timb erwolves

Dikemb eNPNP Mutomb oNPNP ripp edVBD downIN CD reb ounds

Figure D CREP search result for reb ounding statistic verbs

e is the crep option intro ducing the main expression parameter

d is the crep option intro ducing the denition le parameter

defexprdef is the path of the denition le which must contain denitions for b oth PLAYER and

REB

The entries for PLAYER and REB in the denition les were

PLAYER AZazAZNNSNP NPNP NPNP

REB reboundsboards

The entry for player is surrounded by the escap e op erator to indicate that it is written in flex

syntax It roughly means a reference to a player is dened as either a word starting with a capital letter and

2

tagged as a prop er noun or as two consecutive such words The denition for REB lists the two synonym

units used for reb ounding p erformances

Part of the output of this search is given in g D The rst lines of each match contains the whole

corpus sentence retrieved and the last line the particular phrase that caused the match This output shows

that three additional verbs to express reb ounding statistics were uncovered to pul l down to haul in

and to rip down These verbs could then b e included in the surface mapping rule of streak for the

lexicalization of reb ounding statistics

D Enco ding realization patterns as CREP expressions

The rst step towards computing the prop ortion of clusters in the rst test corpus corresp onding to usage

of realization patterns abstracted from the acquisition corpus was to enco de each realization pattern as a

2

Or a noun phrase to match mistagged prop er nouns

crep expression

Recall from Section that a realization pattern captures the mapping from a semantic unit combi

nation onto a particular syntactic structure It sp ecies the syntactic category used to express each semantic

unit and the structural dep endencies b etween these categories For example the realization pattern R in

g D sp ecies that a game result is expressed by a clause and a streak by a PP It also sp ecies that

the game result clause is the head constituent and the streak PP is attached to it as an adjunct This is in

contrast with pattern R where it is the streak that is expressed by the head clause and the game result by

an adjunct PP

Realization patterns abstract away from domain references sp ecic lexical items and lowlevel syntactic

variations Contrast for example the two corpus phrases given for each pattern in g D Enco ding a

realization pattern as a crep expression therefore involves two steps

Write a syntagmatic pattern with most slots lled with a crep subexpression

List the paradigmatic synonyms for each subexpression in the denition le

Step is straightforward It involves following the realization pattern from left to right and adjusting

the distance op erator parameters to account for lowlevel syntactic variations For example the main crep

expression E for the realization pattern R is given in gure D with two matching corpus sentences aligned

b elow it To follow this example note that crep syntax relies on the following conventions

Upp ercase distinguish subexpressions from literals

The sign separates a word from its partofsp eech tag

A blank space separates a words partofsp eech tag from the next word

Step requires more work The denitions for the subexpressions app earing E are given in g D

Writing such subexpression denitions involves identifying words andor phrases satisfying a combination

EXTEND stands for verb syntactic of semantic and syntactic constraints For example the subexpression V

constraint expressing a temp oral extension semantic constraint It covers ve verbs to extend to

prolong to stretch to ride and to increase which are synonyms for this particular sense

To avoid missing clusters corresp onding to a realization pattern requires exhaustive coverage of the

vo cabulary in the denition le Such exhaustiveness can only b e attained incrementally starting from

a manually dened b o otstrapping vo cabulary subset and then rep eating corpus searches with crep sub

expressions containing wildcards For example the only nouns initially known for a win were win and

victory A corpus search with the expression

E an SCORE OVER TEAM

a subexpression of E where no constraint is sp ecied for the word app earing b etween SCORE and OVER

returned sentences like the two b elow where the phrases matching the expression are in italics Indianap olis

UPI and scored p oints apiece Sunday allowing the Indiana Pacers to

extend their franchise record home winning streak to with a triumph over the Philadelphia

ers

or

Los Angeles UPI Michael Jordan scored p oints and grabb ed reb ounds and Scottie Pipp en added

p oints Tuesday night to help the Chicago Bulls extend their winning streak to games with a

rout of the Los Angeles Clippers

WIN with the words triumph rout drubbing Such sentences allowed completing the denition of N

blowing out decision romp upset humiliation trouncing and defeat These nouns are not

all complete synonyms since some of them are appropriate for dierently restricted score ranges However

3

they all ll the same slot in E

2

3

Note that defeat which a priori expresses a loss expresses a win in the particular context of this realization pattern where

a SCORE defeat of TEAM can substitute for a SCORE win over TEAM

Realization pattern R for the content unit combination

gameresultwinnerloserscorestreakwinnerasp ectresulttyp elength

winner gameresult loser score length streakasp ect typ e

agent pro cess aected score result

arg head arg adjunct adjunct

prop er verb prop er numb er PP

prep det ordinal adj noun

Chicago b eat Pho enix for its rd straight win

New York defeated Seattle for its th consecutive victory

Realization pattern R for the same combination

winner asp ect typ e streak length score gameresult loser

agent pro cess aectedlocated lo cation means

arg head arg adjunct adjunct

prop er verb NP PP PP

det participle noun prep det numb er noun PP

Utah extended its win streak to games with a triumph over Denver

Boston stretching its winning spree to outings with a rout of Utah

Figure D Realization pattern examples repro duced from Section

TEAM V EXTEND DET W STREAK to CARD GAMES

Utah extended its win streak to games

Boston stretched its winning spree to outings

with an SCORE N WIN OVER TEAM

with a triumph over Denver

with a rout of Utah

Figure D CREP expression E for realization pattern R

Reference to a team name of a city or name of a franchise or both

TEAM CITYthe CITY FRANCHISE

CITY BostonNew YorkNew JerseyWashingtonPhiladelphia

CharlotteOrlandoMiamiAtlantaIndianaClevelandDetroitChicago

MilwaukeeMinnesotaDenverSan AntonioDallasHoustonPhoenix

UtahLos AngelesGolden StatePortlandSeattleSacramento

FRANCHISE ersCelticsKnicksNetsBulletsSixersHornetsMagicHeatHawks

PacersCavaliersPistonsBullsBucksTimberwolvesNuggetsSpurs

MavericksRocketsSunsJazzLakersClippersWarriors

Trail BlazersSupersonicsSuper SonicsSonicsKings

Verbs of temporal extension

VEXTEND EXTENDPROLONGSTRETCHRIDEINCREASE

EXTEND to extendextendingextended

PROLONG to prolongprolongingprolonged

STRETCH to stretchstretchingstretched

RIDE to rideridingrode

INCREASE to increaseincreasingincreased

Determiners articles or possessive pronouns

DET ARTICLEPOSS

ARTICLE anthe

POSS myyourherhisitsourtheir own

Synonyms for win streak

WINSTREAK winning streakspreeflurryseries

Cardinal numbers

CARD onetwothreefourfivesixseveneightnineteneleven

Synonyms for game

GAMES gamedecisionoutingcontests

Final score of a game

SCORE COMMA CARD CARDCD

CD COMMA

Nouns expressing a victory

NWIN victorywinblowoutblow outdefeatroutdrubbingtriumph

decisionrompupset

Preposition introducing losing team in NPs expressing game results

OVER overversusagainstof

Figure D crep subexpression denitions for realization pattern R

To insure the completeness of the subexpression denitions and the correctness of the main expressions

they were tested rst on the acquisition corpus It is only after these expressions yielded exactly the same

4

results as those of the manual analysis on the acquisition corpus that they were run on the rst test corpus

For such a systematic run the crep package includes a sp ecial shell taking as input a le where each

expression E corresp onding to a given realization pattern is paired with a le name F For each E F

pair this shell redirects the test corpus sentences matching E into F It also redirects the sentences matching

none of the expressions into a nomatch le Manual analysis of the nomatch le is then required to get the

values of the evaluation parameters

D Porting the signature of a revision to ol a detailed example

I now illustrate the pro cess of p orting the crep expression pairs forming the approximate signature of a

revision to ol for the case of Adjoin of CoEvent Clause to Clause The original acquisition domain

signature of this to ol is shown in Fig D This gure contains only the subexpressions diering from those

for the signature of Adjoin of Relative Clause to Top NP which were already given in Fig D of the

previous section The corresp onding test domain signature for the same to ol is shown in Fig D Note

that for monotonic revisions like this one the target pattern needs not contain the entire revised phrase but

only the subphrase added to the source pattern by the revision

An acquisition corpus cluster for each of the two target patterns of the original signature is given b elow

For TARGETB A Utah Jazz hold on to defeat the Los Angeles Lakers extending their

1

home winning streak to games

For TARGETB A the Los Angeles Clippers defeat the Knicks snapping a game

2

losing streak at the hands of New York dating back to February

There are three main dierences b etween these two target patterns

Streak length expression as a clausemo difying PP in TARGETB vs as an NP premo dier in

TARGETB

Streak up date typ e extension in TARGETB vs interruption in TARGETB

Streak result typ e winning in TARGETB vs losing in TARGETB

5

Since these dierences are indep endent these two patterns can b e merged in the more general

TARGETC VGEXTENDVGEND DET STREAKLENGTH WINSTREAKLOSESTREAK

This rst generalization is motivated by the desire to factor out acquisition corpus artifacts during the

pro cess of p orting the signature to the test corpus and thus enhance the chances of nding a matching

test sentence Note that TARGETB also covers streaks of unsp ecied length since b oth alternative

expressions for streak length are left optional which is also desirable although the length was always

sp ecied in the acquisition domain it may not b e the case in the test domain On the test domain corpus

TARGETC matched the following sentences

soared p oints to T The bluechip Hang Seng Index which sank p oints Friday

1

snapping its three session losing streak

T In Australia sto ck prices rallied in active trading on the Sydney Sto ck Exchange snapping a

2

two day decline

4

Some result discrepancies b etween the manual analysis and the crep expression runs on the acquisition corpus uncovered

errors in the former

5

This more general pattern is glossed in the comments of Fig D

Original Signature sourceBtargetB sourceBtargetB

Acquisition domain source pattern team reference followed by a verb conveying

a positive outcome for that team another team reference and a score

SOURCEB TEAM VWIN TEAM SCORE

st acquisition domain target pattern expresses winning streak extension

TARGETB VGEXTEND DET WINSTREAK IN STREAKLENGTH

st acquisition domain target pattern expresses losing streak interruption

TARGETB VGEND DET STREAKLENGTH LOSESTREAK

Extension verbs

VGEXTEND extendingstretchingprolongingriding

Interruption verbs

VGEND snappingendingbreakingstoppinghaltinginterrupting

Verbs conveying a positive outcome for the team reference preceding them

VWIN heldhold on VIWINVPWIN

VIWIN toTO defeatbeatdownedgepoundroutecruise past

VPWIN defeatedbeatdownededgedpoundedroutedcruised past

Nominals conveying streaks

WINSTREAK winning streakspreeflurryseries

LOSESTREAK slidelosing streaklosing skiddroughtslump

Constructs expressing the length of a streak

STREAKLENGTH CARDPREMOD GAME

Cardinal number in premodifying position

CARDPREMOD twothreefourfivesixseveneightnineteneleven

Synonyms for game

GAME gamedecisionoutingcontestsessionday

Figure D Original signature of Adjoin of CoEvent Clause to Clause in sp orts domain

Ported Signature sourceCtargetC

Ported source pattern a financial indicator reference followed by a verb used in

the stock market domain to express a positive result and a score quantifying that

result

SOURCEC INDICATOR SMVWIN SMSCORE

Ported st target pattern an ing verb conveying an extension or interruption

followed by a determiner possibly a streak length and a noun conveying a streak

TARGETC VGEXTENDVGEND DET STREAKLENGTH WINSTREAKLOSESTREAK

Stock market score

SMSCORE SMCARD SMUNIT SMCARD

SMUNIT percentpointsshares

SMCARD millionbillionCARD COMMA CARDCARD

Verbs expressing a positive outcome for the financial indicator preceding them

SMWWIN heldhold on SMVIWINSMVPWIN

SMVIWIN risinggainingjumpingincreasingclimbingsoaring

SMVPWIN rosegainedjumpedincreasedclimbedsoared

Financial indicator reference

INDICATOR stock pricesthe marketvolumestocksINDEX

INDEX Tthe ICPREMOD IPPREMOD INDEXHEAD IPOSTMOD

ICPREMOD keybluechipbroadernarrowerbroaderbased

IPPREMOD Hang SengNikkeiTokyoKoreanSingaporeDow Jones

IPHEAD IindexAaverage

IPPOSTMOD of CARD IndustrialsSelected Issues

Figure D Signature of Adjoin of CoEvent Clause to Clause p orted to the nancial domain

T The key All Ordinaries Index which eased p oints Wednesday rose p oints to

3

snapping its seven day losing streak

Searching for use of a to ol in the test domain starts by considering each target pattern in turn p orting

it to the test domain and analyzing the test sentences it matches Presence in the test corpus of a sentence

matching the p orted target pattern constitutes in itself reasonable evidence for the usage of the revision to ol

For example sentences T ab ove constitute reasonable evidence for the usage of Adjoin of Clause as

13

Coevent in Clause in the sto ck market domain

This evidence is conrmed by the copresence in the test corpus of a surface decrement for any sentence

matched by the p orted target pattern Finding such surface decrement also requires p orting the source

patterns paired with the matching target pattern in the revision signature This p orting pro cess can b e

guided by comparing the test corpus phrases matching the p orted target signature to phrases that parallel

them in the acquisition domain In the case at hand this means p orting the source pattern SOURCEB

of Fig by comparing sentences T and T ab ove which share the same source syntactic structure to

1 3

corresp onding acquisition domain phrases such as

A helping the Los Angeles Clippers defeat the Knicks snapping a game losing streak

2

at the hands of New York dating back to February

This source pattern matches expressions of the gameresult concept of the acquisition domain However

there are conceptual discrepancies b etween gameresult and its test domain counterpart variation First

variation is missing an aected role and the second TEAM subexpression of SOURCEB needs to b e

suppressed Second the agent role of variation is lled by a nancial indicator instead of a basketball

team So the rst TEAM subexpression of SOURCEB which matches basketball team names needs to

b e replaced by an INDICATOR subexpression matching nancial indicators names Dening this new

subexpression requires acquiring indicator names which tend to b e fairly complex eg the Dow Jones

average of industrials See the subexpressions b elow INDICATOR in Fig D for their enco ding After

these conceptual discrepancies have b een accounted for the partially p orted source pattern has b ecome

SOURCEB INDICATOR VWIN SCORE

As is this expression cannot yet match any sto ck market corpus sentence First due to the suppression

of the aected role in the test domain dierent verbs express the p ositive result in each domain transitive

instead of intransitive ones eg the Los Angeles Clippers defeated verbs eg The Gold Index soared

the New York Knicks The subexpression VWIN covering the transitive verbs of the acquisition domain

must thus b e replaced by a subexpression SMVWIN covering the intransitive ones of the test domain

Second the quantities and units in basketball games scores eg dier from those in nancial

indicator scores eg points to The subexpression SCORE in SOURCEB covering

basketball scores thus needs to b e replaced by a subexpression SMSCORE covering sto ck market scores

6

Finally in test corpus a historical fact may b e attached as relative clause to the NP referring to the

indicator whose score is rep orted For cases like these where the only knowledge of the phenomenon comes

from the few sentences that matched the p orted target pattern adjusting the source pattern can only b e

carried out incrementally using trialanderror until a satisfying match is found Since in b oth sentences

7

T and T ab ove the additional relative clause underlined was seven word long the rst p orted source

1 3

expression tried was one allowing up to words b etween the sub ject name and the verb

SOURCEC INDICATOR SMVWIN SMSCORE

This expression matched the source test domain sentence

The key Straits Times Industrials Index which plummeted points Thursday soared points

to

6

Not b elonging to the record or streak classes of historical facts studied in this thesis

7

Counting punctuation marks

Since it matches some test corpus sentences the pair SOURCEC TARGETC constitutes the

nal approximate p orted signature of the revision to ol Adjoin of CoEvent Clause to Clause The goal

of this sto ck market domain analysis was not to discover new realization patterns Instead it was only to

evaluate the p ortability of the revison to ols already discovered in the basketball domain Therefore for a

given revision to ol once one source target pair of realization patterns has b een successfully p orted

there is no need to consider the other such pairs asso ciated with this to ol One can instead pro ceed to the

next revision to ol

Biblio graphy

Ab ella A Ab ella From pictures to words generating lo cative descriptions of ob jects in an image In

Proceedings of the Image Understanding Workshop Monterrey CA Novemb er

Anderson D Anderson Contemporary sports reporting NelsonHall Chicago IL

Andre et al E Andre W Finkler T Graf A Rist A Schauder and W Wahlster WIP the

automatic synthesis of multimedia presentation In M Maybury editor Intel ligent Multimedia Interfaces

AAAI Press Cambridge MA

App elt D App elt Planning Natural Language Utterances Studies in Natural Language Pro cessing

Cambridge University Press

Bateman et al JA Bateman RT Kasp er JD Mo ore and RA Whitney A general organization

of knowledge for natural language pro cessing the PENMAN Upp erMo del Technical rep ort ISI Marina

del Rey CA

Borko H Borko Abstracting Concepts and Methods Academic Press New York NY

Bourb eau et al L Bourb eau D Carcagno E Goldb erg R Kittredge and A Polguere Bilingual

generation of weather forecasts in an op erations environment In Proceedings of the th International

Conference on Computational Linguistics Helsinki University Finland COLING

Bresnan and Kaplan J Bresnan and R Kaplan Lexical Functional Grammars a formal system for

grammatical representation In J Bresnan editor The mental representation of grammatical relations

MIT Press Cambridge MA

Carcagno and Iordanska ja D Carcagno and L Iordanska ja Content determination and text struc

turing two interrelated pro cesses In H Horacek and M Zo ck editors New Concepts in Natural Language

Generation Planning Realization and Systems Frances Pinter London and New York

Cline and Nutter BE Cline and JT Nutter Conceptual revision for natural language generation

Student session of the th Annual Meeting of the ACL

Contant C Contant Generation automatique de texte application au souslanguage b oursier

MA thesis Universite de Montreal

Dale R Dale Generating Referring Expressions ACLMIT Press Series in Natural Language Pro

cessing Cambridge Ma

Danlos L Danlos The linguistic basis of text generation Studies in Natural Language Pro cessing

Cambridge University Press

Davey A Davey Discourse production Edinburgh University Press

De Smedt and Kemp en K De Smedt and G Kemp en Incremental sentence pro duction self

correction and co ordination In Gerard Kemp en editor Natural Language Generation New Results

in Articial Intel l ligence Psychology and Linguistics Martinus Ninjho Publishers

De Smedt K De Smedt IPF an incremental parallel formulator In R Dale CS Mellish and

M Zo ck editors Current Research in Natural Language Generation Academic Press

Duford D Duford CREP a regular expressionmatching textual corpus to ol Technical Rep ort

CUCS Columbia University

Elhadad and Robin M Elhadad and J Robin Controlling Content Realization with Functional Uni

cation Grammars In R Dale H Hovy D Ro esner and O Sto ck editors Aspects of Automated Natural

Language Generation pages Springler Verlag

Elhadad M Elhadad Typ es in Functional Unication Grammars In Proceedings of the th Annual

Meeting of the Association for Computational Linguistics Detroit MI ACL

Elhadad a M Elhadad FUF The universal unier user manual version Technical Rep ort

CUCS Columbia University

Elhadad b M Elhadad Using argumentation to control lexical choice a unicationbased implemen

tation PhD thesis Computer Science Department Columbia University

Fawcett RP Fawcett The semantics of clause and verb for relational pro cesses in English In MAK

Halliday and RP Fawcett editors New developments in systemic linguistics Frances Pinter London and

New York

Fensch T Fensch The sports writing handbook Lawrence Erlbaum Asso ciates Hillsdale NJ

Fries P Fries Tagmeme sequences in the English noun phrase Numb er in Summer institute of

linguistics publications in linguistics and related elds Benjamin F Elson for The Church Press Inc

Glendale CA

Gabriel R Gabriel Delib erate writing In DD McDonald and Bolc L editors Natural Language

Generation Systems SpringerVerlag New York NY

Gop en and Swan GD Gop en and JA Swan The science of scientic writing American Scientist

Gross M Gross LexiconGrammar and the syntactic analysis of French In Proceedings of the th

International Conference on Computational Linguistics pages COLING

Halliday MAK Halliday An introduction to functional grammar Edward Arnold London

Harbusch K Harbusch Towards an Integrated Generation Approach with TreeAdjoining Grammars

Computational Intel ligence

Hatzivassiloglou and McKeown Vasileios Hatzivassiloglou and Kathleen McKeown Towards the Au

tomatic Identication of Adjectival Scales Clustering Adjectives According to Meaning In Proceedings

of the st Annual Meeting of the ACL pages Columbus Ohio June Asso ciation for

Computational Linguistics

Herskovits A Herskovits Language and spatial cognition an interdisciplinary study of the preposi

tions of English Studies in Natural Language Pro cessing Cambridge University Press

Hirschman and Cuomo L Hirschman and D Cuomo Rep ort from the ARPAWorkshop on Evalua

tion of Human Computer Interfaces Technical Rep ort MP B Mitre Bedford Ma

Hovy E Hovy Generating natural language under pragmatic constraints L Erlbaum Asso ciates

Hillsdale NJ

Hovy E Hovy Unresolved issues in paragraph planning In R Dale CS Mellish and M Zo ck

editors Current Research in Natural Language Generation Academic Press

Hovy E Hovy Approaches to the planning of coherent text In C Paris W Swartout and Mann

WC editors Natutal Language Generation in Articial Intel ligence and Computational Linguistics

Kluwer Academic Publishers Boston

Inui et al K Inui T Tokunaga and H Tanaka Text revision a mo del and its implementation In

R Dale E Hovy D Ro esner and O Sto ck editors Aspects of Automated Natural Language Generation

pages SpringlerVerlag

Iordanska ja et al L Iordanska ja M Kim R Kittredge B Lavoie and A Polguere Generation of

extended bilingual statistical rep orts In Proceedings of the th International Conference on Computa

tional Linguistics COLING

Jacobs P Jacobs PHRED a generator for natural language interfaces Computational Linguistics

Joshi et al AK Joshi LS Levy and M Takahashi Tree Adjunct Grammars Journal of Computer

and System Sciences

Joshi AK Joshi The relevance of TreeAdjoining Grammar to generation In Gerard Kemp en

editor Natural Language Generation New Results in Articial Intel l ligence Psychology and Linguistics

Martinus Ninjho Publishers

Kalita J Kalita Automatically generating natural language rep orts International Journal of Man

Machine Studies

Kay M Kay Functional Grammar In Proceedings of the th Annual Meeting of the Berkeley Lin

guistic Society

Kukich et al K Kukich K McKeown J Shaw J Robin N Morgan and J Phillips Userneeds

analysis and design metho dology for an automated do cument generator In A Zamp olli N Calzolari

and M Palmer editors Current Issues in Computational Linguistics In Honour of Don Walker Kluwer

Academic Press Boston

Kukich K Kukich Know ledgebased report generation a know ledge engineering approach to natural

language report generation PhD thesis University of Pittsburgh

Kukich K Kukich The feasibility of automatic natural language generation rep ort In Proceedings

of the th Hawaii international conference on system sciences

Lenat and Guha DB Lenat and RV Guha Building large know ledge base systems representation

and inference in the Cyc project AddisonWesley

Levi J Levi The syntax and semantics of complex nominals Academic Press

Levin B Levin English verb classes and alternations a preliminary investigation University of

Chicago Press

Lyons J Lyons Semantics Cambridge University Press

Mann and Matthiessen WC Mann and CM Matthiessen NIGEL a systemic grammar for text

generation Technical Rep ort ISIRR ISI Marina del Rey CA

Mann and Mo ore WC Mann and JA Mo ore Computer Generation of Multiparagraph English

Text Computational Linguistics

Mann WC Mann An overview of the PENMAN text generation system Technical Rep ort ISIRR

ISI Marina del Rey CA

Maybury MT Maybury Using discourse fo cus temp oral fo cus and spatial fo cus to generate mul

tisentential text In Proceedings of the th International Workshop on Natural Language Generation

Pittsburgh PA

McCoy et al K McCoy K VijayShanker and G Yang A functional approach to generation with

TAG In Proceedings of the th Annual Meeting of the Association for Computational Linguistics ACL

McDonald and Pustejovsky D McDonald and JD Pustejovsky TAGs as a formalism for generation

In Proceedings of the rd Annual Meeting of the Association for Computational Linguistics ACL

McKeown and Swartout KR McKeown and WR Swartout Language generation and explanation

The Annual Review of Computer Science

McKeown et al K R McKeown M Elhadad Y Fukumoto JG Lim C Lombardi J Robin and

FA Smadja Text generation in COMET In R Dale CS Mellish and M Zo ck editors Current

Research in Natural Language Generation Academic Press

McKeown et al K R McKeown J Robin and M Tanenblatt Tailoring lexical choice to the users

vo cabulary in multimedia explanation generation In Proceedings of the st Annual Meeting of the

Association for Computational Linguistics ACL

McKeown et al K McKeown J Robin and K Kukich Generating concise natural language sum

maries Information Processing and Management To app ear in Sp ecial Issue on Summarization

McKeown K R McKeown Using Discourse Strategies and Focus Constraints to Generate Natural

Language Text Studies in Natural Language Pro cessing Cambridge University Press

Melcuk and Pertsov IA Melcuk and NV Pertsov Surfacesyntax of English a formal model in

the MeaningText Theory Benjamins AmsterdamPhiladelphia

Melcuk and Polguere IA Melcuk and A Polguere A formal lexicon in the MeaningText Theory

Computational Linguistics

Mencher M Mencher News reporting and writing Wm C Brown Publishers Dubuque Iowa

Meteer et al MW Meteer DD McDonald SD Anderson D Forster LS Gay AK Huettner

and P Sibun Mumble Design and implementation Technical Rep ort COINS University of

Massachussets at Amherst Ahmerst Ma

Meteer MW Meteer The generation gap the problem of expressibility in text planning PhD thesis

University of Massachussets at Ahmerst Also available as BBN technical rep ort No

Meteer M Meteer The implications of revisions for natural language generation In C Paris

W Swartout and WC Mann editors Natutal Language Generation in Articial Intel ligence and Com

putational Linguistics Kluwer Academic Publishers Boston

MUC Pro ceedings of the Fourth Message Understanding Conference In Proceedings of the Fourth

Message Understanding Conference DARPA Software and Intelligent Systems Technology Oce

Nogier JF Nogier Un systeme de production de language fonde sur le modele des graphes conceptuels

PhD thesis Universite de Paris VI I

Paris C L Paris The use of explicit user models in text generation tailoring to a users level of

expertise PhD thesis Columbia University Also available as technical rep ort CUCS

Pavard B Pavard La conception de systemes de traitement de texte Intel lectica

Polguere A Polguere Structuration et mise en jeu procedurale dun modele linguistique declaratif

dans un cadre de generation de texte PhD thesis Universite de Montreal Queb ec Canada

Pollard and Sag C Pollard and IA Sag Informationbased Syntax and Semantics Volume vol

ume of CSLI Lecture Notes University of Chicago Press Chicago Il

Quirk et al R Quirk S Greenbaum G Leech and J Svartvik A comprehensive grammar of the

English language Longman

Rau LF Rau Information retrieval in never ending stories In Proceedings of the th National

Conference on Articial Intel ligence pages Washington

Reiter EB Reiter A New Mo del for Lexical Choice for Op enClass Words Computational Intel li

gence Decemb er

Reiter EB Reiter Has a consensus natural language generation architecture app eared and is it

psycholinguistically plausible In Proceedings of the th International Workshop on Natural Language

Generation pages June

Robin J Robin Lexical Choice in Natural Language Generation Technical Rep ort CUCS

Columbia University

Ro esner D Ro esner SEMTEX a text generator for German In Gerard Kemp en editor Natu

ral Language Generation New Results in Articial Intel l ligence Psychology and Linguistics Martinus

Ninjho Publishers

Roth et al S Roth J Mattis and X Mesnard Graphics and natural language as comp onent of

automatic explanation In JW Sullivan and SW Tyler editors Intel ligent userinterfaces ACM press

frontier AddisonWesley

Rubino R Rubino Negotiation feedback and perspective within natural language generation PhD

thesis Computer Science Department University of Pennsylvania

Scott and Souza DR Scott and CS Souza Getting the message across in RSTbased text genera

tion In R Dale CS Mellish and M Zo ck editors Current Research in Natural Language Generation

Academic Press

Shab es et al Y Shab es A Ab eille and AK Joshi Parsing strategies with lexicalized grammars

application to tree adjoining grammars In Proceedings of the th International Conference on Compu

tational Linguistics Karl Marx University of Economics Budap est COLING

Shieb er and Shab es SM Shieb er and Y Shab es Generation and synchronous treeadhoining gram

mars Computational Intel ligence Decemb er

Simonin N Simonin Essai de mo delisation de lexp ertise en redaction de textes In Proceedings of

COGNITIVA COGNITIVA

Skaliar D Skaliar A FUF interpreter based on graph representations Technical rep ort Ben Gurion

University of the Negev Beer Sheva Israel

Smadja and McKeown FA Smadja and KR McKeown Using collo cations for language generation

Computational Intel ligence Decemb er

Smadja F Smadja Retrieving Col locational Know ledge from Textual Corpora An Application Lan

guage Generation PhD thesis Computer Science Department Columbia University

Strzalkowski T Ed Strzalkowski Reversible grammars in natural language processing Kluwer

Academic Publishers Boston

Talmy L Talmy Semantic causative typ es In M Shibatani editor The grammar of causative

constructions volume of Syntax and semantics Academic Press London

Talmy L Talmy How language structures space In HL Pick and LP Acredolo editors Spatial

orientation theory research and application Plenum Press New YorkLondon

Talmy L Talmy Lexicalization patterns semantic structure in lexical form In T Shop en edi

tor Grammatical categories and the lexicon volume of Language typology and syntactic description

Cambridge University Press

Thompson and Longacre SA Thompson and RE Longacre Adverbial clauses In T Shop en editor

Complex constructions volume of Language typology and syntactic description Cambridge University

Press

Vaughan and McDonald M Vaughan and DD McDonald A Mo del of Revision in Natural Language

Generation In Proceedings of the rd Annual Meeting of the Association for Computational Linguistics

Columbia University New York ACL

Vendler Zeno Vendler Adjectives and Nominalizations Mouton The Hague Paris

VijayShanker K VijayShanker Using descriptions of trees in a tree adjoining grammar Computa

tional Linguistics

Wahlster et al W Wahlster E Andre W Finkler HJ Protlich and T Rist Planbased integra

tion of natural language and graphics generation Articial Intel ligence

Winograd T Winograd Language as a cognitive process AddisonWesley

Wong and Simmons WKC Wong and RF Simmons A blackb oard mo del for text pro duction with

revision In Proceedings of the AAAI workshop on textplanning and realization StPaul MN AAAI

Yang et al G Yang K McCoy and K VijayShanker From functional sp ecication to syntactic

structure systemic grammar and tree adjoining grammars Computational Intel ligence Decemb er

Yazdani M Yazdani Reviewing as a comp onent of the text generation pro cess In Gerard Kemp en

editor Natural Language Generation New Results in Articial Intel l ligence Psychology and Linguistics

Martinus Ninjho Publishers

Zo ck et al M Zo ck M Kay D Carcagno JF Nogier M Nossin K DeSmedt and F Namer

Automatic text generation a to ol for the buisiness world In Proceedings of the Annual Expert Systems

Conference Avignon France

Zo ck M Zo ck Natural languages are exible to ols thats what makes them hard to explain to

learn and to use In M Zo ck and G Sabah editors Advances in Natural Language Generation an

Interdisciplinary Perspective Pinter and Ablex