Automatic Generation of Multilingual
Total Page:16
File Type:pdf, Size:1020Kb
AUTOMATIC GENERATION OF MULTILINGUAL SPORTS SUMMARIES by Fahim Hasan B.Sc. in C.S.E., BRAC University (Bangladesh), 2007 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE In the School of Computing Science © Fahim Hasan 2011 SIMON FRASER UNIVERSITY Summer 2011 All rights reserved. However, in accordance with the Copyright Act of Canada , this work may be reproduced, without authorization, under the conditions for Fair Dealing . Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. APPROVAL Name: Fahim Hasan Degree: Master of Science Title of Thesis: Automatic Generation of Multilingual Sports Summaries Examining Committee: Chair: Dr. Richard T. Vaughan Associate Professor – Computing Science ______________________________________ Dr. Fred Popowich Senior Supervisor Professor – Computing Science ______________________________________ Dr. Veronica Dahl Supervisor Professor – Computing Science ______________________________________ Dr. Anoop Sarkar Internal Examiner Associate Professor – Computing Science Date Defended/Approved: May 31, 2011 ii Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the “Institutional Repository” link of the SFU Library website <www.lib.sfu.ca> at: <http://ir.lib.sfu.ca/handle/1892/112>) and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author’s written permission. Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence. While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire. The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive. Simon Fraser University Library Burnaby, BC, Canada Last revision: Spring 09 ABSTRACT Natural Language Generation is a subfield of Natural Language Processing, which is concerned with automatically creating human readable text from non-linguistic forms of information. A template-based approach to Natural Language Generation utilizes base formats for different types of sentences, which are subsequently transformed to create the final readable forms of the output. In this thesis, we investigate the suitability of a template-based approach to multilingual Natural Language Generation of sports summaries. We implement a system to generate English and Bangla summaries making use of a pipelined architecture to transform data in multiple stages. Additionally, we demonstrate how the automatically generated summaries differ from human generated summaries. We show that by using a template-based approach the system can generate acceptable output in multiple languages without requiring detailed grammatical knowledge, which is important for languages such as Bangla where computational resources are still scarce. Keywords: Natural Language Processing; Natural Language Generation; Bangla; Template; Pipeline. iii DEDICATION To my family iv ACKNOWLEDGEMENTS I would like to thank my supervisors, Dr. Fred Popowich and Dr. Veronica Dahl for their guidance and motivational comments on my work. I also thank my family and friends for their support. v TABLE OF CONTENTS Approval.......................................................................................................................... ii Abstract...........................................................................................................................iii Dedication ...................................................................................................................... iv Acknowledgements ......................................................................................................... v Table of Contents........................................................................................................... vi List of Figures................................................................................................................viii List of Tables.................................................................................................................. ix Chapter 1: Introduction.................................................................................................1 1.1 NLG Systems..........................................................................................................2 1.2 Motivation................................................................................................................3 1.3 Contributions...........................................................................................................5 1.4 Organization............................................................................................................6 Chapter 2: Related Work...............................................................................................8 2.1 Shallow versus In-Depth Generation Techniques....................................................8 2.2 Template Based Approaches ................................................................................10 2.3 Other Approaches.................................................................................................22 2.4 Chapter Summary.................................................................................................32 Chapter 3: The Generation System............................................................................33 3.1 Overview...............................................................................................................33 3.2 Input Conversion...................................................................................................44 3.3 Pre-Processing .....................................................................................................48 3.4 Content Selection..................................................................................................48 3.4.1 Selection Rules for Batting Items...............................................................51 3.4.2 Selection Rules for Bowling Items..............................................................52 3.4.3 Selection Rules for Catching Items............................................................53 3.5 Aggregation...........................................................................................................54 3.6 Surface Realization ...............................................................................................56 3.6.1 Lexicon......................................................................................................57 3.6.2 Templates..................................................................................................58 3.6.3 Realization.................................................................................................61 3.7 Post-Processing....................................................................................................65 3.8 Implementation Details..........................................................................................65 3.9 Application of the Gricean Maxims ........................................................................66 3.10 Chapter Summary.................................................................................................68 vi Chapter 4: Performance Demonstration....................................................................69 4.1 Methodology .........................................................................................................69 4.2 Results..................................................................................................................75 4.3 Discussion.............................................................................................................81 Chapter 5: Conclusion ................................................................................................84 5.1 Summary...............................................................................................................84