Summarization with a Joint Model for Sentence Extraction and Compression Andre´ F

Total Page:16

File Type:pdf, Size:1020Kb

Summarization with a Joint Model for Sentence Extraction and Compression Andre´ F Summarization with a Joint Model for Sentence Extraction and Compression Andre´ F. T. Martins∗† and Noah A. Smith∗ ∗School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA †Instituto de Telecomunicac¸oes,˜ Instituto Superior Tecnico,´ Lisboa, Portugal {afm,nasmith}@cs.cmu.edu Abstract a promising framework with applications, for exam- ple, in headline generation (Dorr et al., 2003; Jin, Text summarization is one of the oldest prob- 2003), little work has been done to include it as a lems in natural language processing. Popu- module in document summarization systems. Most lar approaches rely on extracting relevant sen- existing approaches (with some exceptions, like the tences from the original documents. As a side vine-growth model of Daume,´ 2006) use a two-stage effect, sentences that are too long but partly architecture, either by first extracting a certain num- relevant are doomed to either not appear in the final summary, or prevent inclusion of other ber of salient sentences and then feeding them into relevant sentences. Sentence compression is a a sentence compressor, or by first compressing all recent framework that aims to select the short- sentences and extracting later. However, regardless est subsequence of words that yields an infor- of which operation is performed first—compression mative and grammatical sentence. This work or extraction—two-step “pipeline” approaches may proposes a one-step approach for document fail to find overall-optimal solutions; often the sum- summarization that jointly performs sentence maries are not better that the ones produced by ex- extraction and compression by solving an in- teger linear program. We report favorable ex- tractive summarization. On the other hand, a pilot perimental results on newswire data. study carried out by Lin (2003) suggests that sum- marization systems that perform sentence compres- sion have the potential to beat pure extractive sys- 1 Introduction tems if they model cross-sentence effects. In this work, we address this issue by merging the Automatic text summarization dates back to the tasks of sentence extraction and sentence compres- 1950s and 1960s (Luhn, 1958; Baxendale, 1958; Ed- sion into a global optimization problem. A careful mundson, 1969). Today, the proliferation of digital design of the objective function encourages “sparse information makes research on summarization tech- solutions,” i.e., solutions that involve only a small nologies more important than ever before. In the last number of sentences whose compressions are to be two decades, machine learning techniques have been included in the summary. Our contributions are: employed in extractive summarization of single documents (Kupiec et al., 1995; Aone et al., 1999; • We cast joint sentence extraction and compression Osborne, 2002) and multiple documents (Radev and as an integer linear program (ILP); McKeown, 1998; Carbonell and Goldstein, 1998; Radev et al., 2000). Most of this work aims only • We provide a new formulation of sentence com- to extract relevant sentences from the original doc- pression using dependency parsing information uments and present them as the summary; this sim- that only requires a linear number of variables, plification of the problem yields scalable solutions. and combine it with a bigram model; Some attention has been devoted by the NLP • We show how the full model can be trained in a community to the related problem of sentence com- max-margin framework. Since a dataset of sum- pression (Knight and Marcu, 2000): given a long maries comprised of extracted, compressed sen- sentence, how to maximally compress it into a gram- tences is unavailable, we present a procedure that matical sentence that still preserves all the rele- trains the compression and extraction models sep- vant information? While sentence compression is arately and tunes a parameter to interpolate the two models. Maximal Marginal Relevance (MMR). For long documents or large collections, it becomes impor- The compression model and the full system are tant to penalize the redundancy among the extracted compared with state-of-the-art baselines in standard sentences. Carbonell and Goldstein (1998) proposed newswire datasets. This paper is organized as fol- greedily adding sentences to the summary S to max- lows: §2–3 provide an overview of our two building imize, at each step, a score of the form blocks, sentence extraction and sentence compres- sion. §4 describes our method to perform one-step λ · scorerel(ti) − (1 − λ) · scorered(ti,S), (2) sentence compression and extraction. §5 shows ex- periments in newswire data. Finally, §6 concludes where scorerel(ti) is as in Eq. 1 and scorered(ti,S) the paper and suggests future work. accounts for the redundancy between ti and the cur- rent summary S. In our experiments, redundancy is 2 Extractive summarization the 1-gram cosine similarity between the sentence Extractive summarization builds a summary by ex- ti and the current summary S. The trade-off be- tracting a few informative sentences from the docu- tween relevance and redundancy is controlled by λ ∈ [0, 1], which is tuned on development data. ments. Let D , {t1, . , tM } be a set of sentences, contained in a single or in multiple related docu- McDonald (2007) proposed a non-greedy variant ments.1 The goal is to extract the best sequence of of MMR that takes into account the redundancy be- tween each pair of candidate sentences. This is cast sentences hti1 , . , tiK i that summarizes D whose total length does not exceed a fixed budget of J as a global optimization problem: words. We describe some well-known approaches ˆ P S = arg max λ · t ∈S scorerel(ti) − that wil serve as our experimental baselines. S i P (1 − λ) · scorered(ti, tj), (3) Extract the leading sentences (Lead). For ti,tj ∈S single-document summarization, the simplest where score (t ) θ> f (t ), score (t , t ) method consists of greedily extracting the leading rel i , rel rel i red i j , θ> f (t , t ), and f (t ) and f (t , t ) are feature sentences while they fit into the summary. A sen- red red i j rel i red i j vectors with corresponding learned weight vectors tence is skipped if its inclusion exceeds the budget, θ and θ . He has shown how the relevance-based and the next is examined. This performs extremely rel red method and the MMR framework (in the non-greedy well in newswire articles, due to the journalistic form of Eq. 3) can be cast as an ILP. By introducing convention of summarizing the article first. indicator variables hµiii=1,...,M and hµijii,j=1,...,M Rank by relevance (Rel). This method ranks sen- with the meanings tences by a relevance score, and then extracts the top 1 if t is to be extracted ones that can fit into the summary. The score is typ- µ = i i 0 otherwise ically a linear function of feature values: 1 if ti and tj are both to be extracted > PD µij = scorerel(ti) , θ f(ti) = d=1 θdfd(ti), (1) 0 otherwise (4) 2 Here, each fd(ti) is a feature extracted from sen- one can reformulate Eq. 3 as an ILP with O(M ) tence ti, and θd is the corresponding weight. In our variables and constraints: experiments, relevance features include (i) the recip- PM rocal position in the document, (ii) a binary feature max λ · i=1 µiscorerel(ti) − (5) hµ i,hµ i indicating whether the sentence is the first one, and i ij PM PM (iii) the 1-gram and 2-gram cosine similarity with (1 − λ) · i=1 j=1 µijscorered(ti, tj), the headline and with the full document. subject to binary constraints µi, µij ∈ {0, 1}, the 1 For simplicity, we describe a unified framework for single length constraint PM µ N ≤ J (where N is the and multi-document summarization, although they may require i=1 i i i specialized strategies. Here we experiment only with single- number of words of the ith sentence), and the fol- document summarization and assume t1, ..., tM are ordered. lowing “agreement constraints” for i, j = 1,...,M (that impose the logical relation µij = µi ∧ µj): 2000; Daume´ and Marcu, 2002), heuristic methods that parse the sentence and then trim constituents ac- µij ≤ µi, µij ≤ µj, µij ≥ µi + µj − 1 (6) cording to linguistic criteria (Dorr et al., 2003; Zajic et al., 2006), a pure discriminative model (McDon- Let us provide a compact representation of the pro- ald, 2006), and an ILP formulation (Clarke and La- gram in Eq. 5 that will be used later. Define our vec- pata, 2008). We next give an overview of the two tor of parameters as θ [λθ , −(1−λ)θ ]. Pack- , rel red latter approaches. ing all the feature vectors (one for each sentence, and McDonald (2006) uses the outputs of two parsers one for each pair of sentences) into a matrix F, (a phrase-based and a dependency parser) as fea- Frel 0 tures in a discriminative model that decomposes F , , (7) 0 Fred over pairs of consecutive words. Formally, given a sentence t = hw1, . , wN i, the score of a compres- with Frel , [frel(ti)]1≤i≤M and Fred , sion c = hwj1 , . , wjL i decomposes as: [fred(ti, tj)]1≤i<j≤M , and packing all the variables PL > µi and µij into a vector µ, the program in Eq. 5 can score(c; t) = l=2 θ f(t, jl−1, jl) (9) be compactly written as where f(t, jl−1, jl) are feature vectors that depend > max θ Fµ, (8) on the original sentence t and consecutive positions µ jl−1 and jl, and θ is a learned weight vector. The subject to binary and linear constraints on µ.
Recommended publications
  • Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification
    University of Arkansas, Fayetteville ScholarWorks@UARK Theses and Dissertations 5-2020 Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification Marion Pauline Chiariglione University of Arkansas, Fayetteville Follow this and additional works at: https://scholarworks.uark.edu/etd Part of the Numerical Analysis and Scientific Computing Commons, and the Theory and Algorithms Commons Citation Chiariglione, M. P. (2020). Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification. Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/3580 This Thesis is brought to you for free and open access by ScholarWorks@UARK. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of ScholarWorks@UARK. For more information, please contact [email protected]. Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science by Marion Pauline Chiariglione IUT Dijon, University of Burgundy Bachelor of Science in Computer Science, 2017 May 2020 University of Arkansas This thesis is approved for recommendation to the Graduate Council Susan Gauch, Ph.D. Thesis Director Qinghua Li, Ph.D. Committee member Khoa Luu, Ph.D. Committee member Abstract Quoting a borrowed excerpt of text within another literary work was infrequently done prior to the beginning of the eighteenth century. However, quoting other texts, particularly Shakespeare, became quite common after that. Our work develops automatic approaches to identify that trend. Initial work focuses on identifying exact and modified sections of texts taken from works of Shakespeare in novels spanning the eighteenth century. We then introduce a novel approach to identifying modified quotes by adapting the Edit Distance metric, which is character based, to a word based approach.
    [Show full text]
  • Using N-Grams to Understand the Nature of Summaries
    Using N-Grams to Understand the Nature of Summaries Michele Banko and Lucy Vanderwende One Microsoft Way Redmond, WA 98052 {mbanko, lucyv}@microsoft.com views of the event being described over different Abstract documents, or present a high-level view of an event that is not explicitly reflected in any single document. A Although single-document summarization is a useful multi-document summary may also indicate the well-studied task, the nature of multi- presence of new or distinct information contained within document summarization is only beginning to a set of documents describing the same topic (McKeown be studied in detail. While close attention has et. al., 1999, Mani and Bloedorn, 1999). To meet these been paid to what technologies are necessary expectations, a multi-document summary is required to when moving from single to multi-document generalize, condense and merge information coming summarization, the properties of human- from multiple sources. written multi-document summaries have not Although single-document summarization is a well- been quantified. In this paper, we empirically studied task (see Mani and Maybury, 1999 for an characterize human-written summaries overview), multi-document summarization is only provided in a widely used summarization recently being studied closely (Marcu & Gerber 2001). corpus by attempting to answer the questions: While close attention has been paid to multi-document Can multi-document summaries that are summarization technologies (Barzilay et al. 2002, written by humans be characterized as Goldstein et al 2000), the inherent properties of human- extractive or generative? Are multi-document written multi-document summaries have not yet been summaries less extractive than single- quantified.
    [Show full text]
  • Multi-Document Biography Summarization
    Multi-document Biography Summarization Liang Zhou, Miruna Ticrea, Eduard Hovy University of Southern California Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292-6695 {liangz, miruna, hovy} @isi.edu Abstract In this paper we describe a biography summarization system using sentence classification and ideas from information retrieval. Although the individual techniques are not new, assembling and applying them to generate multi-document biographies is new. Our system was evaluated in DUC2004. It is among the top performers in task 5–short summaries focused by person questions. 1 Introduction Automatic text summarization is one form of information management. It is described as selecting a subset of sentences from a document that is in size a small percentage of the original and Figure 1. Overall design of the biography yet is just as informative. Summaries can serve as summarization system. surrogates of the full texts in the context of To determine what and how sentences are Information Retrieval (IR). Summaries are created selected and ranked, a simple IR method and from two types of text sources, a single document experimental classification methods both or a set of documents. Multi-document contributed. The set of top-scoring sentences, after summarization (MDS) is a natural and more redundancy removal, is the resulting biography. elaborative extension of the single-document As yet, the system contains no inter-sentence summarization, and poses additional difficulties on ‘smoothing’ stage. algorithm design. Various kinds of summaries fall In this paper, work in related areas is discussed into two broad categories: generic summaries are in Section 2; a description of our biography corpus the direct derivatives of the source texts; special- used for training and testing the classification interest summaries are generated in response to component is in Section 3; Section 4 explains the queries or topic-oriented questions.
    [Show full text]
  • Quesgen Using Nlp 01
    Quesgen Using Nlp 01 International Journal of Latest Trends in Engineering and Technology Vol.(13)Issue(2), pp.009-014 DOI: http://dx.doi.org/10.21172/1.132.02 e-ISSN:2278-621X QUESGEN USING NLP Pawan NGP1, Pooja Bahuguni2, Pooja Dattatri3, Shilpi Kumari4, Vikranth B.M5 Abstract— when people read for long hours, they seldom are able to grasp concepts and it gives them false sense of understanding it. The aim of this project is to tackle this problem by processing given text and generating applicable questions and answer. The steps followed are: 1. Candidate key sentences are selected (using Text Rank). 2. Candidate key words are selected from candidate key sentences (RAKE). 3. These selected key sentences and words are stored in the database (MongoDB) and presented to the user through chatbot interface. Keywords— NLP, NLP toolkit, Sentence extraction, Keyword extraction, ChatBot, RAKE, TextRank 1. INTRODUCTION Humans are the most curious by nature. Asking Questions to meet their never-ending quest for information and knowledge. For Example,teachers ask students, questions to evaluate performance of the students, pupils learn by asking questions to teachers,and even our normal life conversation consists of asking questions. Questions are the major part of countless learning interactions. However, with the advent of technology, attention spans of individuals have significantly gone down and they are not able to ask good questions. It has been noticed that when people try to read for long hours, they seldom are able to grasp concepts. But having spent some time reading gives people a false sense of understanding it.
    [Show full text]
  • An Automatic Text Summarization for Malayalam Using Sentence Extraction
    International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-3, Issue-8, Aug.-2015 AN AUTOMATIC TEXT SUMMARIZATION FOR MALAYALAM USING SENTENCE EXTRACTION 1RENJITH S R, 2SONY P 1M.Tech Computer and Information Science, Dept.of Computer Science, College of Engineering Cherthala Kerala, India-688541 2Assistant Professor, Dept. of Computer Science, College of Engineering Cherthala, Kerala, India-688541 Abstract—Text Summarization is the process of generating a short summary for the document that contains the significant portion of information. In an automatic text summarization process, a text is given to the computer and the computer returns a shorter less redundant extract of the original text. The proposed method is a sentence extraction based single document text summarization which produces a generic summary for a Malayalam document. Sentences are ranked based on feature scores and Googles PageRank formula. Top k ranked sentences will be included in summary where k depends on the compression ratio between original text and summary. Performance evaluation will be done by comparing the summarization outputs with manual summaries generated by human evaluators. Keywords—Text summarization, Sentence Extraction, Stemming, TF-ISF score, Sentence similarity, PageRank formula, Summary generation. I. INTRODUCTION a summary, which represents the subject matter of an article by understanding the whole meaning, which With enormous growth of information on cyberspace, are generated by reformulating the salient unit conventional Information Retrieval techniques have selected from an input sentences. It may contain some become inefficient for finding relevant information text units which are not present in the input text. An effectively. When we give a keyword to be searched extract is a summary consisting of a number of on the internet, it returns thousands of documents sentences selected from the input text.Sentence overwhelming the user.
    [Show full text]
  • Automatic Document Summarization by Sentence Extraction
    Вычислительные технологии Том 12, № 5, 2007 AUTOMATIC DOCUMENT SUMMARIZATION BY SENTENCE EXTRACTION R. M. Aliguliyev Institute of Information Technology of the National Academy of Sciences of Azerbaijan, Baku e-mail: [email protected] Представлен метод автоматического реферирования документов, который ге- нерирует резюме документа путем кластеризации и извлечения предложений из исходного документа. Преимущество предложенного подхода в том, что сгенери- рованное резюме документа может включать основное содержание практически всех тем, представленных в документе. Для определения оптимального числа кла- стеров введен критерий оценки качества кластеризации. Introduction Automatic document processing is a research field that is currently extremely active. One important task in this field is automatic document summarization, which preserves its information content [1]. With a large volume of text documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Text search and summarization are the two essential technologies that complement each other. Text search engines return a set of documents that seem to be relevant to the user’s query, and text enable quick examinations through the returned documents. In general, automatic document summarization takes a source document (or source documents) as input, extracts the essence of the source(s), and presents a well-formed summary to the user. Mani and Maybury [1] formally defined automatic document summari- zation as the process of distilling the most important information from a source(s) to produce an abridged version for a particular user (or users) and task (or tasks). The process can be decomposed into three phases: analysis, transformation and synthesis. The analysis phase analyzes the input document and selects a few salient features.
    [Show full text]
  • The Elements of Automatic Summarization
    The Elements of Automatic Summarization by Daniel Jacob Gillick A dissertation submitted in partial satisfaction of the requirements for the degree of in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Nelson Morgan, Chair Professor Daniel Klein Professor Thomas Griffiths Spring 2011 The Elements of Automatic Summarization Copyright © 2011 by Daniel Jacob Gillick Abstract The Elements of Automatic Summarization by Daniel Jacob Gillick Doctor of Philosophy in Computer Science University of California, Berkeley Professor Nelson Morgan, Chair This thesis is about automatic summarization, with experimental results on multi- document news topics: how to choose a series of sentences that best represents a col- lection of articles about one topic. I describe prior work and my own improvements on each component of a summarization system, including preprocessing, sentence valuation, sentence selection and compression, sentence ordering, and evaluation of summaries. The centerpiece of this work is an objective function for summariza- tion that I call "maximum coverage". The intuition is that a good summary covers as many possible important facts or concepts in the original documents. It turns out that this objective, while computationally intractable in general, can be solved efficiently for medium-sized problems and has reasonably good fast approximate so- lutions. Most importantly, the use of an objective function marks a departure from previous algorithmic approaches to summarization. 1 Acknowledgements Getting a Ph.D. is hard. Not really hard in the day-to-day sense, but more because I spent a lot of time working on a few small problems, and at the end of six years, I have only made a few small contributions to the world.
    [Show full text]
  • A Query Focused Multi Document Automatic Summarization
    PACLIC 24 Proceedings 545 A Query Focused Multi Document Automatic Summarization Pinaki Bhaskar and Sivaji Bandyopadhyay Department of Computer Science & Engineering, Jadavpur University, Kolkata – 700032, India [email protected], [email protected] Abstract. The present paper describes the development of a query focused multi-document automatic summarization. A graph is constructed, where the nodes are sentences of the documents and edge scores reflect the correlation measure between the nodes. The system clusters similar texts having related topical features from the graph using edge scores. Next, query dependent weights for each sentence are added to the edge score of the sentence and accumulated with the corresponding cluster score. Top ranked sentence of each cluster is identified and compressed using a dependency parser. The compressed sentences are included in the output summary. The inter-document cluster is revisited in order until the length of the summary is less than the maximum limit. The summarizer has been tested on the standard TAC 2008 test data sets of the Update Summarization Track. Evaluation of the summarizer yielded accuracy scores of 0.10317 (ROUGE-2) and 0.13998 (ROUGE–SU-4). Keywords: Multi Document Summarizer, Query Focused, Cluster based approach, Parsed and Compressed Sentences, ROUGE Evaluation. 1 Introduction Text Summarization, as the process of identifying the most salient information in a document or set of documents (for multi document summarization) and conveying it in less space, became an active field of research in both Information Retrieval (IR) and Natural Language Processing (NLP) communities. Summarization shares some basic techniques with indexing as both are concerned with identification of the essence of a document.
    [Show full text]
  • Automatic Summarization of Medical Conversations, a Review
    Jessica López Espejel Automatic summarization of medical conversations, a review Jessica López Espejel 1, 2 (1) CEA, LIST, DIASI, F-91191 Gif-sur-Yvette, France. (2) Paris 13 University, LIPN, 93430 Villateneuse, France. [email protected] RÉSUMÉ L’analyse de la conversation joue un rôle important dans le développement d’appareils de simulation pour la formation des professionnels de la santé (médecins, infirmières). Notre objectif est de développer une méthode de synthèse automatique originale pour les conversations médicales entre un patient et un professionnel de la santé, basée sur les avancées récentes en matière de synthèse à l’aide de réseaux de neurones convolutionnels et récurrents. La méthode proposée doit être adaptée aux problèmes spécifiques liés à la synthèse des dialogues. Cet article présente une revue des différentes méthodes pour les résumés par extraction et par abstraction et pour l’analyse du dialogue. Nous décrivons aussi les utilisation du Traitement Automatique des Langues dans le domaine médical. ABSTRACT Conversational analysis plays an important role in the development of simulation devices for the training of health professionals (doctors, nurses). Our goal is to develop an original automatic synthesis method for medical conversations between a patient and a healthcare professional, based on recent advances in summarization using convolutional and recurrent neural networks. The proposed method must be adapted to the specific problems related to the synthesis of dialogues. This article presents a review of the different methods for extractive and abstractive summarization, and for dialogue analysis. We also describe the use of Natural Language Processing in the medical field. MOTS-CLÉS : Résumé automatique, domaine médical, dialogues, revue.
    [Show full text]
  • A Hybrid Approach for Multi-Document Text Summarization
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 12-11-2019 A Hybrid Approach for Multi-document Text Summarization Rashmi Varma Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Databases and Information Systems Commons A Hybrid Approach for Multi-document Text Summarization A Project Presented to The Faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements for the Degree Master of Science by Rashmi Varma December 2019 ○c 2019 Rashmi Varma ALL RIGHTS RESERVED The Designated Project Committee Approves the Project Titled A Hybrid Approach for Multi-document Text Summarization by Rashmi Varma APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE SAN JOSE STATE UNIVERSITY December 2019 Dr. Robert Chun Department of Computer Science Dr. Katerina Potika Department of Computer Science Ms. Manasi Thakur Tuutkia Inc. ABSTRACT A Hybrid Approach for Multi-document Text Summarization by Rashmi Varma Text summarization has been a long studied topic in the field of natural language processing. There have been various approaches for both extractive text summarization as well as abstractive text summarization. Summarizing texts for a single document is a methodical task. But summarizing multiple documents poses as a greater challenge. This thesis explores the application of Latent Semantic Analysis, Text-Rank, Lex-Rank and Reduction algorithms for single document text summarization and compares it with the proposed approach of creating a hybrid system combining each of the above algorithms, individually, with Restricted Boltzmann Machines for multi-document text summarization and analyzing how all the approaches perform.
    [Show full text]
  • A Chinese Chatter Robot for Online Shopping Guide
    Ch2R: A Chinese Chatter Robot for Online Shopping Guide Peijie Huang, Xianmao Lin, Zeqi Lian, De Yang, Xiaoling Tang, Li Huang, Qi- ang Huang, Xiupeng Wu, Guisheng Wu and Xinrui Zhang College of Informatics, South China Agricultural University, Guangzhou 510642, Guangdong, China [email protected] tive recommendations, thus behaving more like Abstract human agents (Liu et al., 2010). For example, in the scenario that we envision, on online e- In this paper we present a conversational commerce site, an intelligent dialogue system dialogue system, Ch2R (Chinese Chatter which roles play a conversational shopping guide Robot) for online shopping guide, which may suggest which digital camera is a better allows users to inquire about information choice, considering brand, price, pixel, etc.; or of mobile phone in Chinese. The purpose suggest which mobile phone is the most popular of this paper is to describe our develop- among teenagers or highest rated by users. ment effort in terms of the underlying hu- In this paper, we describe our development ef- man language technologies (HLTs) as fort on a Chinese chatter robot, named Ch2R well as other system issues. We focus on a (Chinese Chatter Robot) for shopping guide mixed-initiative conversation mechanism with both intelligent ability and professional for interactive shopping guide combining knowledge. The challenges of developing such a initiative guiding and question under- information guiding dialogue system in Chinese standing. We also present some evalua- includes: 1) how to provide active assistance and tion on the system in mobile phone shop- subjective recommendations; 2) how to deal with ping guide domain.
    [Show full text]
  • Relation Extraction Using Distant Supervision: a Survey
    Relation Extraction Using Distant Supervision: a Survey ALISA SMIRNOVA, eXascale Infolab, U. of Fribourg—Switzerland, Switzerland PHILIPPE CUDRÉ-MAUROUX, eXascale Infolab, U. of Fribourg—Switzerland, Switzerland Relation extraction is a subtask of information extraction where semantic relationships are extracted from nat- ural language text and then classified. In essence, it allows to acquire structured knowledge from unstructured text. In this work, we present a survey of relation extraction methods that leverage pre-existing structured or semi-structured data to guide the extraction process. We introduce a taxonomy of existing methods and describe distant supervision approaches in detail. We describe in addition the evaluation methodologies and the datasets commonly used for quality assessment. Finally, we give a high-level outlook on the field, highlighting open problems as well as the most promising research directions. CCS Concepts: • Information systems → Content analysis and feature selection; Information extrac- tion; Additional Key Words and Phrases: relation extraction, distant supervision, knowledge graph ACM Reference Format: Alisa Smirnova and Philippe Cudré-Mauroux. 2019. Relation Extraction Using Distant Supervision: a Survey. ACM Comput. Surv. 1, 1 (March 2019), 35 pages. https://doi.org/10.1145/3241741 1 INTRODUCTION Despite the rise of semi-structured and structured data, text is still the most widespread content in companies and on the Web. As understanding text is deemed AI-complete, however, a popular idea is to extract structured data from text in order to make it machine-readable or processable. One of the main subtasks in this context is extracting semantic relations between various entities in free-form text, also known as relation extraction.
    [Show full text]