Automatic Summarization of News Using Wordnet Concept Graphs

IADIS International Journal on Computer Science and Information Systems Vol. 5, No.1, pp. 45-57 ISSN: 1646-3692 AUTOMATIC SUMMARIZATION OF NEWS USING WORDNET CONCEPT GRAPHS Laura Plaza , Facultad de Informática, Universidad Complutense de Madrid, C/ Prof. José García Santesmases, s/n. 28040 Madrid (Spain) [email protected] Alberto Díaz, Facultad de Informática, Universidad Complutense de Madrid, C/ Prof. José García Santesmases, s/n. 28040 Madrid (Spain) [email protected] Pablo Gervás, Instituto de Tecnología del Conocimiento, Universidad Complutense de Madrid , C/ Prof. José García Santesmases, s/n. 28040 Madrid (Spain) [email protected] ABSTRACT One of the main handicaps in research on automatic summarization is the vague semantic comprehension of the source, which is reflected in the poor quality of the consequent summaries. Using further knowledge, as that provided by ontologies, to construct a complex semantic representation of the text, can considerably alleviate the problem. In this paper, we introduce an ontology-based extractive method for summarization. It is based on mapping the text to concepts and representing the document and its sentences as graphs. We have applied our approach to news articles, taking advantages of free resources such as WordNet. Preliminary empirical results are presented and pending problems are identified. KEYWORDS Automatic Summarization, Graph Theory, Ontology, Natural Language Processing 1. INTRODUCTION Nowadays, Internet access to news sites has become a day to day practice. But the huge amount of news generated every day makes an exhaustive reading unfeasible. In order to tackle this overload of information, automatic summarization can undoubtedly play a role, 45 IADIS International Journal on Computer Science and Information Systems allowing users to get a proper idea of what an article is about in just a few lines without having to read the complete item. Some news delivery services already provide summarization tools to support users in selecting relevant information in the news items. Nevertheless, there is much room for improvements. In past years, a large volume of resources, such as ontologies like WordNet, has emerged. As they intend to provide particular meanings of terms as they apply to the domain in hands, they can definitely benefit the development of NLP systems and, in particular, when used in automatic summarization, they can increase the quality of the resulting summaries. Automatic document summarization has been an important subject of study since pioneer works by Luhn and Edmundson in the 50s and 60s. While these early approaches were based on simple heuristic features, such as the position of sentences in the document (Brandow et al ., 1995) or the frequency of the words they contain (Luhn, 1958; Edmundson, 1969), recently, several graph-based methods have been proposed to rank sentences for extraction (Erkan and Radev, 2004). Several authors (Erkan and Radev, 2004; Mihalcea and Tarau, 2004) have applied the graph theory to text summarization, in order to construct a shallow representation of the documents from text units. However, few approaches explore more complex representations based on concepts connected by semantic relations (synonymy, hypernymy, and similarity relations). One of the main arguments defending the use of shallow representations is their language independence, while semantic representation provides additional knowledge that can benefit the quality of the resulting summary. In this paper, we introduce a graph-based approach to extractive summarization for domain-independent documents, which uses ontologies to identify concepts and semantic relations between them and allows a richer text representation. The method proposed uses a semantic graph-based representation for the documents, where vertices are the concepts in WordNet associated to the terms, and the edges indicate different relations between them. This representation makes it possible to combine the desired domain-independence with the use of complex semantic relations. The paper is organized as follows. The related work is discussed in Section 2. Section 3 introduces the lexical database WordNet. Section 4 presents our semantic graph-based method for extractive summarization. Section 5 shows the experimental results of the preliminary evaluation developed. Finally, we draw some conclusions and future lines of work in Section 6. 2. RELATED WORK Automatic summarization is the process through which the relevant information from one or several sources is identified in order to produce a briefer version intended for a particular user - or group of users - or a particular task (Maña, 1999). Under this definition, the various types of summaries can be classified according to the purpose, the scope and their focus (Mani, 2001). Regarding the scope, a summary may be restricted to a single document or to a set of documents about the same topic. Regarding their purpose, that is, the use or task for which they are intended, summaries are classified as: 46 AUTOMATIC SUMMARIZATION OF NEWS USING WORDNET CONCEPT GRAPHS • Indicative , if the aim is to anticipate for the user the content of the text and to help him to decide on the relevance of the original document. • Informative , if they aim to substitute the original text by incorporating all the new or relevant information. • Critical , if they incorporate opinions or comments that do not appear in the original text. Finally, attending to their focus, we can distinguish between: • Generic , if they gather the main topics of the document and they are addressed to a wide group of readers. • User adapted , if the summary is constructed according to the interests - i.e. previous knowledge, areas of interest, or information needs - of the particular reader or group of readers that is addressing (Díaz et al., 2007). Sparck-Jones (Sparck-Jones, 1999) defined a summary as a reductive transformation of source text to text through content reduction by selection and/or generalization on what is important in the source. This definition may seem obvious, but the truth is that nowadays automatic summarization still exhibits important deficiencies and continues concentrating a considerable body of work. The definition by Sparck-Jones suggested that there exist two generic groups of summarization methods: those which generate extracts and those which generate abstracts. Extractive methods construct summaries basically by selecting salient sentences from documents and therefore they are integrally composed of material that is explicitly present in the source. Although human summaries are typically abstracts, most of existing systems produce extracts. Sentence extractive methods typically build summaries based on a superficial analysis of the source. Early summarization systems were based on simple heuristic features, as the position of sentences in the document (Brandow et al., 1995), the frequency of the words they contain (Luhn, 1958; Edmundson, 1969), or the presence of certain cue words or indicative phrases (Edmundson, 1969). Some advanced approaches also employ machine learning techniques to determine the best set of attributes for extraction (Kupiec et al., 1995). A second group of summarization systems are those frequently known as entity-level approaches (Mani and Maybury, 1999). These approaches build an internal representation for the text, modeling the text entities (i.e. terms, phrases, sentences or even paragraphs) and their relationships. The metrics for determining the relatedness of the entities include the cosine similarity, proximity or distance between text units, co-occurrence, etc.) . Recently, several graph-based methods have been proposed to rank sentences for extraction. LexRank (Erkan and Radev, 2004) is an example of a centroid-based method to automatic multi-document summarization that assesses sentence importance based on the concept of eigenvector centrality. It assumes a fully connected, undirected graph with sentences as nodes and similarities between them as edges. It represents the sentences in each document by its tf*idf vectors and computes sentence connectivity using the cosine similarity. A very similar system is TextRank (Mihalcea and Tarau, 2004), which has been applied to single-document summarization as well as to other NLP tasks. Most recently, Litvak and Last (Litvak and Last, 2008) proposed a novel approach that makes use of a graph-based syntactic representation of text documents for keyword extraction to be used as a first step in summarization. Even if results are promising, both graph-based approaches exhibit important deficiencies which are consequences of not capturing the semantic relations between terms (synonymy, 47 IADIS International Journal on Computer Science and Information Systems hypernymy, homonymy, co-occurrence relations, and so on). The following two sentences illustrate these problems. 1. “Hurricanes are useful to the climate machine. Their primary role is to transport heat from the lower to the upper atmosphere,'' he said. 2. He explained that cyclones are part of the atmospheric circulation mechanism, as they move heat from the superior to the inferior atmosphere. As both sentences present different terms, approaches based on term frequencies do not succeed in determining that both have exactly the same meaning. However, methods based on semantic representations indeed capture this equivalence. 3. WORDNET WordNet is an electronic lexical database developed at Princeton University (Miller

Load more