Download File
Total Page:16
File Type:pdf, Size:1020Kb
RevisionBased Generation of Natural Language Summaries Providing Historical Background CorpusBased Analysis Design Implementation and Evaluation Jacques Robin Technical Rep ort CUCS Submitted in partial fulllment of the requirements for the degree of Do ctor of Philosophy in the Graduate Scho ol of Arts and Sciences Columbia University Decemb er c Jacques Robin All Rights Reserved Abstract RevisionBased Generation of Natural Language Summaries Providing Historical Background CorpusBased Analysis Design Implementation and Evaluation Jacques Robin Automatically summarizing vast amounts of online quantitative data with a short natural language para graph has a wide range of realworld applications However this sp ecic task raises a numb er of dicult issues that are quite distinct from the generic task of language generation conciseness complex sentences oating concepts historical background paraphrasing p ower and implicit content In this thesis I address these sp ecic issues by prop osing a new generation mo del in which a rst pass builds a draft containing only the essential new facts to rep ort and a second pass incrementally revises this draft to opportunistical ly add as many background facts as can t within the space limit This mo del requires a new typ e of linguistic knowledge revision op erations which sp ecifyies the various ways a draft can b e transformed in order to concisely accommo date a new piece of information I present an indepth corpus analysis of humanwritten sp orts summaries that resulted in an extensive set of such revision op erations I also present the implementation based on functional unication grammars of the system streak which relies on these op erations to incrementally generate complex sentences summarizing basketball games This thesis also contains two quantitative evaluations The rst shows that the new revisionbased generation mo del is far more robust than the onepass mo del of previous generators The second evaluation demonstrates that the revision op erations acquired during the corpus analysis and implemented in streak are for the most part p ortable to at least one other quantitative domain the sto ck market streak is the rst rep ort generator that systematically places the facts which it summarizes in their historical p ersp ective It is more concise than previous systems thanks to its ability to generate more complex sentences and to opp ortunistically convey facts by adding a few words to carefully chosen draft constituents The revision op erations on which streak is based constitute the rst set of corpusbased linguistic knowledge geared towards incremental generation The evaluation presented in this thesis is also the rst attempt to quantitatively assess the robustness of a new generation mo del and the p ortability of a new typ e of linguistic knowledge Contents Intro duction Starting p oints The need for summary rep ort generation Summarization issues Asp ects of the research An opp ortunistic generation mo del A corpus analysis to acquire revision rules An implemented revisionbased generator A quantitative evaluation of the opp ortunistic mo del and revision rules Contributions Guide to the rest of the thesis Corpus analysis of newswire rep orts Motivation and metho dology Fo cusing on a subcorpus Rep ort structure and discursive fo cus Content typ es and semantic fo cus Domain ontology Corpus sentence structures Toplevel structure concept clustering Internal cluster structure realization patterns Revision to ols Dierential analysis of realization patterns identifying revision to ols Classifying revision op erations Summary A new generation architecture Text generation subtasks Motivation for a new architecture From corpus observations to needed abilities i From needed abilities to design principles A revisionbased architecture for incremental generation Internal utterance representation Pro cessing comp onents and overall control Generation subtasks revisited Summary Implementation prototyp e the STREAK generator Goals and scop e of the implementation The internal representation languages The DSS language The SSS language The draft representation The pro cessing comp onents The phrase planner The lexicalizer The reviser STREAK at work two example runs Chaining revisions to generate a complex sentence from a simple draft Choice of revision rule constrained by the surface form of the draft STREAK by the numb ers Evaluation Evaluation and language generation The diculty of the issue The rising interest in the issue The diversity of the issue The evaluation approach of this thesis Quantitative evaluation of robustness Review of acquisition corpus results Evaluation goals and test corp ora Evaluation parameters Partially automating the evaluation Results Quantitative evaluation of p ortability Starting p oint Metho dology Results ii Related work Related work in summary rep ort generation Ana Systems based on the MeaningText Theory Other summary rep ort generators Comparison with STREAK Related work in revisionbased generation Yh weiveR At what level to p erform revision Revising to satisfy what goals Related work in incremental generation Incremental generation and Tree Adjoining Grammars Incremental generation in IPF Related work in evaluation for generation Conclusion Contributions Contributions to summary rep ort generation Contributions to generation architecture Contributions to revisionbased generation and incremental generation Contributions to evaluation in generation Contributions to SURGE Limitations and future work Limitations linked to the use of FUF .