Distributional Semantics for Robust Automatic Summarization

Total Page:16

File Type:pdf, Size:1020Kb

Distributional Semantics for Robust Automatic Summarization DISTRIBUTIONAL SEMANTICS FOR ROBUST AUTOMATIC SUMMARIZATION by Jackie Chi Kit Cheung A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright © 2014 by Jackie Chi Kit Cheung Abstract Distributional Semantics for Robust Automatic Summarization Jackie Chi Kit Cheung Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2014 Large text collections are an important resource of information about the world, containing everything from movie reviews and research papers to news articles about current events. Yet the sheer size of such collections presents a challenge for applications to make sense of this data and present it to users. Automatic summarization is one potential solution which aims to shorten one or more source documents while retaining the important information. Summa- rization is a complex task that requires inferences about the form and content of the summary using a semantic model. This dissertation examines the feasibility of distributional semantics as the core seman- tic representation for automatic summarization. In distributional semantics, the meanings of words and phrases are modelled by the contexts in which they appear. These models are easy to train and have found successful applications, but they have until recently not been seriously considered as contenders to support semantic inference for complex NLP tasks such as sum- marization because of a lack of evaluation methods that would demonstrate their benefit. I argue that current automatic summarization systems avoid relying on semantic analysis by focusing instead on replicating the source text to be summarized, but that substantial progress will not be possible without semantic analysis and domain knowledge acquisition. To over- come these problems, I propose an evaluation framework for distributional semantics based on first principles about the role of a semantic formalism in supporting inference. My experi- ments show that current distributional semantic approaches can support semantic inference at ii a phrasal level invariant to the constituent syntactic constructions better than a word overlap baseline. Then, I present a novel technique to embed distributional semantic vectors into a genera- tive probabilistic model for domain modelling. This model achieves state-of-the-art results in slot induction, which also translates into better summarization performance. Finally, I intro- duce a text-to-text generation technique called sentence enhancement that combines parts of heterogeneous source text sentences into a novel sentence, resulting in more informative and grammatical summary sentences than a previous sentence fusion approach. The success of this approach relies crucially on distributional semantics in order to determine which parts may be combined. These results lay the groundwork for the development of future distributional semantic models, and demonstrate their utility in determining the form and content of automatic sum- maries. iii Acknowledgements Graduate school has been a memorable journey, and I am very fortunate to have met some wonderful people along the way. First of all, I’d like to thank my advisor, Gerald Penn, who has been extremely patient and supportive. He let me pursue the research topics that I fancied while providing guidance to my research and career. After all this time, I am still amazed by the depth and breadth of his research vision, and his acumen in selecting meaningful research problems. I can only hope that enough of his taste in research and his sense of humour has rubbed off on me. I’d also like to thank the other members of my committee. Graeme Hirst has been an influ- ential mentor. His insightful questions and comments have caused me to reflect on the nature of computational linguistics research. Hector Levesque brought in a much needed outside per- spective from knowledge representation. I’d like to thank Suzanne Stevenson for interesting discussions in seminars, meetings, and at my defense. My external examiner, Mirella Lapata, has been a great inspiration for my research, and provided valuable feedback on this disserta- tion. During my degree, I spent two highly productive terms at Microsoft Research Redmond in the Speech and Natural Language Processing groups. I’d like to thank my internship mentors Xiao Li, Hoifung Poon, and Lucy Vanderwende for fruitful discussions and collaborations, and the other members of the Speech and NLP groups for lively conversations over lunch. I am grateful to have been supported by the Natural Sciences and Engineering Research Council of Canada and by a Facebook PhD Fellowship. Graduate student life would have been drab without the enrichment of colleagues and friends. The Department of Computer Science and the Computational Linguistics group in particular are full of smart and friendly people who filled my days with joy. Jorge Aranda got me hooked on board games. Giovanna Thron and Siavosh Benabbas threw awesome parties. Jono Lung and Aditya Bhargava taught me about finance. Elizabeth Patitsas is a good friend who owns two cats and is very fit. Andrew Perrault introduced me to climbing. Erin Delisle iv will do a great job organizing games. Aida Nematzadeh and Amin Tootoonchian are the kind- est people that I know, and put up with my attempts to learn Farsi. George Dahl keeps me (and the CL community) honest. The lunch crew (Afsaneh Fazly, Libby Barak, Katie Fraser, Varada Kolhatkar, and Nona Naderi) accepted me into their fold. Michael Guerzhoy is a conspirator in further bouts of not-working. Yuval Filmus is brilliant and has a good perspective on life. Timothy Fowler, Eric Corlett, and Patricia Araujo Thaine are great meeting and dinner buddies. Dai Tri Le Man and Sergey Gorbunov kept my appreciation of music alive. Dan Lister, Sam Kim, and Mischa Auerbach-Ziogas made for formidable Agricola opponents. There are many others both inside and outside of the department who I am lucky to have met. I am also lucky to have “old friends” from Vancouver like Hung Yao Tjia, Muneori Otaka, Joseph Leung, Rory Harris and Sandra Yuen to catch up with. Finally, my family is my rock. My brother Vincent, my sister Joyce and her new family help me put things in perspective. My parents, Cecilia and David, are always there to support me through all of my hopes and dreams, whining and boasting, failures and successes. They remind me that life is more than just about research, and are the foundation that gives me confidence to lead my life. v To my parents. vi Contents 1 Introduction 1 1.1 Two Approaches to Semantics . .3 1.2 Semantics in Automatic Summarization . .5 1.3 Thesis Statement . .7 1.4 Dissertation Objectives . .8 1.5 Structure of the Dissertation . 10 1.5.1 Peer-Reviewed Publications . 12 2 Centrality in Automatic Summarization 13 2.1 The Design and Evaluation of Summarization Systems . 13 2.1.1 Classification of Summaries . 14 2.1.2 Steps in Summarization . 17 2.1.3 Summarization Evaluation . 17 2.2 The Assumption of Centrality . 23 2.2.1 Text Properties Important in Summarization . 24 2.2.2 Cognitive Determinants of Interest, Surprise, and Memorability . 26 2.2.3 Centrality in Content Selection . 29 2.2.4 Abstractive Summarization . 36 2.3 Summarizing Remarks on Summarization . 38 vii 3 A Case for Domain Knowledge in Automatic Summarization 40 3.1 Related Work . 42 3.2 Theoretical basis of the analysis . 43 3.2.1 Caseframe Similarity . 45 3.2.2 An Example . 46 3.3 Experiments . 46 3.3.1 Study 1: Sentence Aggregation . 47 3.3.2 Study 2: Signature Caseframe Density . 50 3.3.3 Study 3: Summary Reconstruction . 54 3.4 Why Source-External Elements? . 57 3.4.1 Study 4: Provenance Study . 58 3.4.2 Study 5: Domain Study . 60 3.5 Summary and Discussion . 62 4 Compositional Distributional Semantics 64 4.1 Compositionality and Co-Compositionality in Distributional Semantics . 64 4.1.1 Several Distributional Semantic Models . 66 4.1.2 Other Distributional Models . 68 4.2 Evaluating Distributional Semantics for Inference . 71 4.2.1 Existing Evaluations . 73 4.2.2 An Evaluation Framework . 75 4.2.3 Task 1: Relation Classification . 76 4.2.4 Task 2: Restricted QA . 78 4.3 Experiments . 80 4.3.1 Task 1 . 82 4.3.2 Task 2 . 84 4.4 Conclusions . 85 viii 5 Distributional Semantic Hidden Markov Models 87 5.1 Related Work . 89 5.2 Distributional Semantic Hidden Markov Models . 90 5.2.1 Contextualization . 94 5.2.2 Training and Inference . 96 5.2.3 Summary and Generative Process . 96 5.3 Experiments . 97 5.4 Guided Summarization Slot Induction . 98 5.5 Multi-document Summarization: An Extrinsic Evaluation . 100 5.5.1 A KL-based Criterion . 101 5.5.2 Supervised Learning . 102 5.5.3 Method and Results . 103 5.6 Discussion . 104 6 Sentence Enhancement for Automatic Summarization 107 6.1 Sentence Revision for Abstractive Summarization . 107 6.2 Related Work . 109 6.3 A Sentence Enhancement Algorithm . 110 6.3.1 Sentence Graph Creation . 111 6.3.2 Sentence Graph Expansion . 112 6.3.3 Tree Generation . 115 6.3.4 Linearization . 117 6.4 Experiments . 118 6.4.1 Method . 118 6.4.2 Results and Discussion . 121 6.5 Discussion . 122 6.6 Appendix: ILP Encoding . 123 6.6.1 Informativeness Score . 123 ix 6.6.2 Objective Function . 123 6.6.3 Syntactic Constraints . 124 6.6.4 Semantic Constraints . 125 7 Conclusion 126 7.1 Summary of Contributions . 126 7.1.1 Distributional Semantics . 126 7.1.2 Abstractive Summarization . 127 7.2 Limitations and Future Work . 128 7.2.1 Distributional Semantics in Probabilistic Models . 128 7.2.2 Abstractive Summarization . 129 Bibliography 130 x List of Tables 3.1 A sentence decomposed into its dependency edges, and the caseframes derived from those edges that are considered.
Recommended publications
  • Automatic Summarization of Student Course Feedback
    Automatic Summarization of Student Course Feedback Wencan Luo† Fei Liu‡ Zitao Liu† Diane Litman† †University of Pittsburgh, Pittsburgh, PA 15260 ‡University of Central Florida, Orlando, FL 32716 wencan, ztliu, litman @cs.pitt.edu [email protected] { } Abstract Prompt Describe what you found most interesting in today’s class Student course feedback is generated daily in Student Responses both classrooms and online course discussion S1: The main topics of this course seem interesting and forums. Traditionally, instructors manually correspond with my major (Chemical engineering) analyze these responses in a costly manner. In S2: I found the group activity most interesting this work, we propose a new approach to sum- S3: Process that make materials marizing student course feedback based on S4: I found the properties of bike elements to be most the integer linear programming (ILP) frame- interesting work. Our approach allows different student S5: How materials are manufactured S6: Finding out what we will learn in this class was responses to share co-occurrence statistics and interesting to me alleviates sparsity issues. Experimental results S7: The activity with the bicycle parts on a student feedback corpus show that our S8: “part of a bike” activity approach outperforms a range of baselines in ... (rest omitted, 53 responses in total.) terms of both ROUGE scores and human eval- uation. Reference Summary - group activity of analyzing bicycle’s parts - materials processing - the main topic of this course 1 Introduction Table 1: Example student responses and a reference summary Instructors love to solicit feedback from students. created by the teaching assistant. ‘S1’–‘S8’ are student IDs.
    [Show full text]
  • Using N-Grams to Understand the Nature of Summaries
    Using N-Grams to Understand the Nature of Summaries Michele Banko and Lucy Vanderwende One Microsoft Way Redmond, WA 98052 {mbanko, lucyv}@microsoft.com views of the event being described over different Abstract documents, or present a high-level view of an event that is not explicitly reflected in any single document. A Although single-document summarization is a useful multi-document summary may also indicate the well-studied task, the nature of multi- presence of new or distinct information contained within document summarization is only beginning to a set of documents describing the same topic (McKeown be studied in detail. While close attention has et. al., 1999, Mani and Bloedorn, 1999). To meet these been paid to what technologies are necessary expectations, a multi-document summary is required to when moving from single to multi-document generalize, condense and merge information coming summarization, the properties of human- from multiple sources. written multi-document summaries have not Although single-document summarization is a well- been quantified. In this paper, we empirically studied task (see Mani and Maybury, 1999 for an characterize human-written summaries overview), multi-document summarization is only provided in a widely used summarization recently being studied closely (Marcu & Gerber 2001). corpus by attempting to answer the questions: While close attention has been paid to multi-document Can multi-document summaries that are summarization technologies (Barzilay et al. 2002, written by humans be characterized as Goldstein et al 2000), the inherent properties of human- extractive or generative? Are multi-document written multi-document summaries have not yet been summaries less extractive than single- quantified.
    [Show full text]
  • Automatic Summarization of Medical Conversations, a Review Jessica Lopez
    Automatic summarization of medical conversations, a review Jessica Lopez To cite this version: Jessica Lopez. Automatic summarization of medical conversations, a review. TALN-RECITAL 2019- PFIA 2019, Jul 2019, Toulouse, France. pp.487-498. hal-02611210 HAL Id: hal-02611210 https://hal.archives-ouvertes.fr/hal-02611210 Submitted on 30 May 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Jessica López Espejel Automatic summarization of medical conversations, a review Jessica López Espejel 1, 2 (1) CEA, LIST, DIASI, F-91191 Gif-sur-Yvette, France. (2) Paris 13 University, LIPN, 93430 Villateneuse, France. [email protected] RÉSUMÉ L’analyse de la conversation joue un rôle important dans le développement d’appareils de simulation pour la formation des professionnels de la santé (médecins, infirmières). Notre objectif est de développer une méthode de synthèse automatique originale pour les conversations médicales entre un patient et un professionnel de la santé, basée sur les avancées récentes en matière de synthèse à l’aide de réseaux de neurones convolutionnels et récurrents. La méthode proposée doit être adaptée aux problèmes spécifiques liés à la synthèse des dialogues. Cet article présente une revue des différentes méthodes pour les résumés par extraction et par abstraction et pour l’analyse du dialogue.
    [Show full text]
  • Exploring Sentence Vector Spaces Through Automatic Summarization
    Under review as a conference paper at ICLR 2018 EXPLORING SENTENCE VECTOR SPACES THROUGH AUTOMATIC SUMMARIZATION Anonymous authors Paper under double-blind review ABSTRACT Vector semantics, especially sentence vectors, have recently been used success- fully in many areas of natural language processing. However, relatively little work has explored the internal structure and properties of spaces of sentence vectors. In this paper, we will explore the properties of sentence vectors by studying a par- ticular real-world application: Automatic Summarization. In particular, we show that cosine similarity between sentence vectors and document vectors is strongly correlated with sentence importance and that vector semantics can identify and correct gaps between the sentences chosen so far and the document. In addition, we identify specific dimensions which are linked to effective summaries. To our knowledge, this is the first time specific dimensions of sentence embeddings have been connected to sentence properties. We also compare the features of differ- ent methods of sentence embeddings. Many of these insights have applications in uses of sentence embeddings far beyond summarization. 1 INTRODUCTION Vector semantics have been growing in popularity for many other natural language processing appli- cations. Vector semantics attempt to represent words as vectors in a high-dimensional space, where vectors which are close to each other have similar meanings. Various models of vector semantics have been proposed, such as LSA (Landauer & Dumais, 1997), word2vec (Mikolov et al., 2013), and GLOVE(Pennington et al., 2014), and these models have proved to be successful in other natural language processing applications. While these models work well for individual words, producing equivalent vectors for sentences or documents has proven to be more difficult.
    [Show full text]
  • Keyphrase Based Evaluation of Automatic Text Summarization
    International Journal of Computer Applications (0975 – 8887) Volume 117 – No. 7, May 2015 Keyphrase based Evaluation of Automatic Text Summarization Fatma Elghannam Tarek El-Shishtawy Electronics Research Institute Faculty of Computers and Information Cairo, Egypt Benha University, Benha, Egypt ABSTRACT KpEval idea is to count the matches between the peer The development of methods to deal with the informative summary and reference summaries for the essential parts of contents of the text units in the matching process is a major the summary text. KpEval have three main modules, i) challenge in automatic summary evaluation systems that use lemma extractor module that breaks the text into words and fixed n-gram matching. The limitation causes inaccurate extracts their lemma forms and the associated lexical and matching between units in a peer and reference summaries. syntactic features, ii) keyphrase extractor that extracts The present study introduces a new Keyphrase based important keyphrases in their lemma forms, and iii) the Summary Evaluator (KpEval) for evaluating automatic evaluator that scoring the summary based on counting the summaries. The KpEval relies on the keyphrases since they matched keyphrases occur between the peer summary and one convey the most important concepts of a text. In the or more reference summaries. The remaining of this paper is evaluation process, the keyphrases are used in their lemma organized as follows: Section 2 reviews the previous works; form as the matching text unit. The system was applied to Section 3 the proposed keyphrase based summary evaluator; evaluate different summaries of Arabic multi-document data Section 4 discusses the performance evaluation; and section 5 set presented at TAC2011.
    [Show full text]
  • Latent Semantic Analysis and the Construction of Coherent Extracts
    Latent Semantic Analysis and the Construction of Coherent Extracts Tristan Miller German Research Center for Artificial Intelligence0 Erwin-Schrodinger-Straße¨ 57, D-67663 Kaiserslautern [email protected] Keywords: automatic summarization, latent semantic analy- many of these techniques are tied to a particular sis, LSA, coherence, extracts language or require resources such as a list of dis- Abstract course keywords and a manually marked-up cor- pus; others are constrained in the type of summary We describe a language-neutral au- they can generate (e.g., general-purpose vs. query- tomatic summarization system which focussed). aims to produce coherent extracts. It In this paper, we present a new, recursive builds an initial extract composed solely method for automatic text summarization which of topic sentences, and then recursively aims to preserve both the topic coverage and fills in the topical lacunae by provid- the coherence of the source document, yet has ing linking material between semanti- minimal reliance on language-specific NLP tools. cally dissimilar sentences. While exper- Only word- and sentence-boundary detection rou- iments with human judges did not prove tines are required. The system produces general- a statistically significant increase in tex- purpose extracts of single documents, though it tual coherence with the use of a latent should not be difficult to adapt the technique semantic analysis module, we found a to query-focussed summarization, and may also strong positive correlation between co- be of use in improving the coherence of multi- herence and overall summary quality. document summaries. 2 Latent semantic analysis 1 Introduction Our system fits within the general category of IR- A major problem with automatically-produced based systems, but rather than comparing text with summaries in general, and extracts in particular, the standard vector-space model, we employ la- is that the output text often lacks fluency and orga- tent semantic analysis (LSA) [Deerwester et al., nization.
    [Show full text]
  • Leveraging Word Embeddings for Spoken Document Summarization
    Leveraging Word Embeddings for Spoken Document Summarization Kuan-Yu Chen*†, Shih-Hung Liu*, Hsin-Min Wang*, Berlin Chen#, Hsin-Hsi Chen† *Institute of Information Science, Academia Sinica, Taiwan #National Taiwan Normal University, Taiwan †National Taiwan University, Taiwan * # † {kychen, journey, whm}@iis.sinica.edu.tw, [email protected], [email protected] Abstract without human annotations involved. Popular methods include Owing to the rapidly growing multimedia content available on the vector space model (VSM) [9], the latent semantic analysis the Internet, extractive spoken document summarization, with (LSA) method [9], the Markov random walk (MRW) method the purpose of automatically selecting a set of representative [10], the maximum marginal relevance (MMR) method [11], sentences from a spoken document to concisely express the the sentence significant score method [12], the unigram most important theme of the document, has been an active area language model-based (ULM) method [4], the LexRank of research and experimentation. On the other hand, word method [13], the submodularity-based method [14], and the embedding has emerged as a newly favorite research subject integer linear programming (ILP) method [15]. Statistical because of its excellent performance in many natural language features may include the term (word) frequency, linguistic processing (NLP)-related tasks. However, as far as we are score, recognition confidence measure, and prosodic aware, there are relatively few studies investigating its use in information. In contrast, supervised sentence classification extractive text or speech summarization. A common thread of methods, such as the Gaussian mixture model (GMM) [9], the leveraging word embeddings in the summarization process is Bayesian classifier (BC) [16], the support vector machine to represent the document (or sentence) by averaging the word (SVM) [17], and the conditional random fields (CRFs) [18], embeddings of the words occurring in the document (or usually formulate sentence selection as a binary classification sentence).
    [Show full text]
  • Automatic Summarization and Readability
    COGNITIVE SCIENCE MASTER THESIS Automatic summarization and Readability LIU-IDA/KOGVET-A–11/004–SE Author: Supervisor: Christian SMITH Arne JONSSON¨ [email protected] [email protected] List of Figures 2.1 A simplified graph where sentences are linked and weighted ac- cording to the cosine values between them. 10 3.1 Evaluation of summaries on different dimensionalities. The X-axis denotes different dimensionalities, the Y-axis plots the mean value from evaluations on several seeds on each dimensionality. 18 3.2 The iterations of PageRank. The figure depicts the ranks of the sentences plotted on the Y-axis and the iterations on the X-axis. Each series represents a sentence. 19 3.3 The iterations of PageRank in different dimensionalities of the RI- space. The figure depicts 4 different graphs, each representing the trial of a specific setting of the dimensionality. From the left the dimensionalities of 10, 100, 300 and 1000 were used. The ranks of the sentences is plotted on the Y-axis and the iterations on the X-axis. Each series represents a sentence. 20 3.4 Figures of sentence ranks on different damping factors in PageRank 21 3.5 Effect of randomness, same text and settings on ten different ran- dom seeds. The final ranks of the sentences is plotted on the Y, with the different seeds on X. The graph depicts 9 trials at follow- ing dimensionalities, from left: 10, 20, 50, 100, 300, 500, 1000, 2000, 10000. 22 3.6 Values sometimes don’t converge on smaller texts. The left graph depicts the text in d=100 and the right graph d=20.
    [Show full text]
  • Question-Based Text Summarization
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Drexel Libraries E-Repository and Archives Question-based Text Summarization A Thesis Submitted to the Faculty of Drexel University by Mengwen Liu in partial fulfillment of the requirements for the degree of Doctor of Philosophy December 2017 c Copyright 2017 Mengwen Liu. All Rights Reserved. ii Dedications This thesis is dedicated to my family iii Acknowledgments I would like to thank to my advisor, Prof. Xiaohua Tony Hu for his generous guidance and support in the past five and a half years. Prof. Hu provided a flexible environment for me to explore different research fields. He encouraged me to work with intellectual peers and attend professional activities, and gave immediate suggestions on my studies and career development. I am also grateful to Prof. Yuan An, whom I worked closely in the first two years of my doctoral study. I thank to him for offering guidance about research, paper review, collaboration with professionals from other fields, and scientific writings. I would like to thank my dissertation committee of Prof. Weimao Ke, Prof. Erjia Yan, Prof. Bei Yu, and Prof. Yi Fang. Prof. Ke provided many suggestions towards my research projects that relate to information retrieval and my thesis topic. Prof. Yan offered me lots of valuable suggestions for my prospectus and thesis and encouraged me to conduct long-term research. Prof. Yu inspired my interest in natural language processing and text mining, and taught me how to do research from scratch with great patience.
    [Show full text]
  • Bit.Ly/Pm19hms Can AI Accelerate Precision Medicine?
    Department of Biomedical Informatics 10 Shattuck Street, Suite 514 Boston, MA 02115 dbmi.hms.harvard.edu @HarvardDBMI TWITTER #pm19hms LINK bit.ly/pm19hms Can AI Accelerate Precision Medicine? Harvard Medical School June 18, 2019 Welcome Recent progress in artificial intelligence has been touted as the solution for many of the challenges facing clinicians and patients in diagnosing and treating disease. Concurrently, there is a growing concern about the biases and lapses caused by the use of algorithms that are incompletely understood. Yet precision medicine* requires sifting through millions of individuals measured with thousands of variables: an analysis that defeats human cognitive capabilities. In this conference, we ask the question whether AI can be used effectively to accelerate precision medicine in ways that are safe, non-discriminatory, affordable and in the patient’s best interest. To answer the question, we have three panels and three keynote speakers. The first panel addresses the question ofPolicy and the Patient: Who’s in Control? We will review the intersection of consent, transparency and regulatory oversight in this very dynamic ethical landscape. The second panel, Is There Value in Prediction?, addresses a widely-shared challenge in medicine, which is to predict the patient’s future, and most importantly, their response to therapy. Can AI actually make an important contribution here? The third panel on Hyperindividualized Treatments asks the question of how to think about a future that is fast becoming the present, where therapy is “hyper-individualized,” so it is the very uniqueness of your genome that determines your therapy. As always, we have a patient representative for our opening keynote, Matt Might, who returns to this annual conference to discuss the development of an algorithm for conducting precision medicine, through the lens of a personal story: discovering that his child was the first case of a new, ultra-rare genetic disorder.
    [Show full text]
  • Sentence Fusion for Multidocument News Summarization
    Sentence Fusion for Multidocument News Summarization ∗ † Regina Barzilay Kathleen R. McKeown Massachusetts Institute of Technology Columbia University A system that can produce informative summaries, highlighting common information found in many online documents, will help Web users to pinpoint information that they need without extensive reading. In this article, we introduce sentence fusion, a novel text-to-text generation technique for synthesizing common information across documents. Sentence fusion involves bottom-up local multisequence alignment to identify phrases conveying similar information and statistical generation to combine common phrases into a sentence. Sentence fusion moves the summarization field from the use of purely extractive methods to the generation of abstracts that contain sentences not found in any of the input documents and can synthesize information across sources. 1. Introduction Redundancy in large text collections, such as the Web, creates both problems and opportunities for natural language systems. On the one hand, the presence of numer- ous sources conveying the same information causes difficulties for end users of search engines and news providers; they must read the same information over and over again. On the other hand, redundancy can be exploited to identify important and accurate information for applications such as summarization and question answering (Mani and Bloedorn 1997; Radev and McKeown 1998; Radev, Prager, and Samn 2000; Clarke, Cormack, and Lynam 2001; Dumais et al. 2002; Chu-Carroll et al. 2003). Clearly, it would be highly desirable to have a mechanism that could identify common information among multiple related documents and fuse it into a coherent text. In this article, we present a method for sentence fusion that exploits redundancy to achieve this task in the context of multidocument summarization.
    [Show full text]
  • Artificial Intelligence: Implications for Business Strategy Online Short Course
    MIT SLOAN SCHOOL OF MANAGEMENT MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY (CSAIL) ARTIFICIAL INTELLIGENCE: IMPLICATIONS FOR BUSINESS STRATEGY ONLINE SHORT COURSE Gain the knowledge and confdence to support the integration of AI into your organization. Certifcate Track: Management and Leadership ABOUT THIS COURSE What is artifcial intelligence (AI)? What does it mean for business? And how can your company take advantage of it? This online program, designed by the MIT Sloan School of Management and the MIT Computer Science and Artifcial Intelligence Laboratory (CSAIL), will help you answer these questions. Through an engaging mix of introductions to key technologies, business insights, case examples, and your own business-focused project, your learning journey will bring into sharp focus the reality of central AI technologies today and how they can be harnessed to support your business needs. Focusing on key AI technologies, such as machine learning, natural language processing, and robotics, the course will help you understand the implications of these new technologies for business strategy, as well as the economic and societal issues they raise. MIT expert instructors examine how artifcial intelligence will complement and strengthen our workforce rather than just eliminate jobs. Additionally, the program will emphasize how the collective intelligence of people and computers together can solve business problems that not long ago were considered impossible. WHAT THE PROGRAM COVERS Upon completion of the program, you’ll be ready to apply your knowledge to support informed This 6-week online program presents you with a foundational understanding of where we are today strategic decision making around the use of key AI with AI and how we got here.
    [Show full text]