Redalyc.Using Semantic Graphs and Word Sense Disambiguation

Redalyc.Using Semantic Graphs and Word Sense Disambiguation

Procesamiento del Lenguaje Natural ISSN: 1135-5948 [email protected] Sociedad Española para el Procesamiento del Lenguaje Natural España Plaza, Laura; Díaz, Alberto Using Semantic Graphs and Word Sense Disambiguation Techniques to Improve Text Summarization Procesamiento del Lenguaje Natural, núm. 47, septiembre, 2011, pp. 97-105 Sociedad Española para el Procesamiento del Lenguaje Natural Jaén, España Available in: http://www.redalyc.org/articulo.oa?id=515751747010 How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative Procesamiento del Lenguaje Natural, Revista nº 47 septiembre de 2011, pp 97-105 recibido 23-04-2011 aceptado 24-05-2011 Using Semantic Graphs and Word Sense Disambiguation Techniques to Improve Text Summarization Uso de Grafos Sem´anticos y de T´ecnicas de Desambiguaci´onen la Generaci´onAutom´atica de Res´umenes Laura Plaza Alberto D´ıaz Universidad Complutense de Madrid Universidad Complutense de Madrid Prof. Jos´eGarc´ıaSantesmases, s/n Prof. Jos´eGarc´ıaSantesmases, s/n 28040 Madrid 28040 Madrid [email protected] [email protected] Resumen: En este trabajo se presenta un m´etodo para la generaci´onautom´atica de res´umenesbasado en grafos sem´anticos. El sistema utiliza conceptos y relaciones de WordNet para construir un grafo que representa el documento, as´ıcomo un al- goritmo de clustering basado en la conectividad para descubrir los distintos temas tratados en ´el.La selecci´onde oraciones para el resumen se realiza en funci´onde la presencia en las oraciones de los conceptos m´asrepresentativos del documento. Los experimentos realizados demuestran que el enfoque propuesto obtiene resultados significativamente mejores que otros sistemas evaluados bajo las mismas condiciones experimentales. Asimismo, el sistema puede ser f´acilmente adaptado para trabajar con documentos de diferentes dominios, sin m´asque modificar la base de conocimien- to y el m´etodo para identificar conceptos en el texto. Finalmente, este trabajo tam- bi´enestudia el efecto de la ambig¨uedadl´exicaen la generaci´onde res´umenes. Palabras clave: Generaci´onautom´aticade res´umenes,grafos sem´anticos, desam- biguaci´onl´exicay sem´antica, agrupamiento de conceptos Abstract: This paper presents a semantic graph-based method for extractive sum- marization. The summarizer uses WordNet concepts and relations to produce a se- mantic graph that represents the document, and a degree-based clustering algorithm is used to discover different themes or topics within the text. The selection of sen- tences for the summary is based on the presence in them of the most representative concepts for each topic. The method has proven to be an efficient approach to the identification of salient concepts and topics in free text. In a test on the DUC data for single document summarization, our system achieves significantly better results than previous approaches based on terms and mere syntactic information. Besides, the system can be easily ported to other domains, as it only requires modifying the knowledge base and the method for concept annotation. In addition, we address the problem of word ambiguity in semantic approaches to automatic summarization. Keywords: Automatic summarization, semantic graphs, word sense disambigua- tion, concept clustering 1. Introduction words in their context (the sentence, or even The problem of summarizing textual docu- the whole document), which is not the way a ments has been extensively studied during human thinks when writing a summary. the past half century. Common approach- Recently, graph-based methods have at- es include training different machine learn- tracted the attention of the NLP commu- ing models; computing some simple heuristic nity. These methods have been applied to rules (such as sentence position or cue words); a wide range of tasks, such as word sense or counting the frequency of the words in the disambiguation (Agirre and Soroa, 2009) document to identify central terms. However, or question answering (Celikyilmaz, Thint, these approaches think of words as indepen- and Huang, 2009). Regarding summariza- dent entities that do not interact with other tion, graph-based methods have typically ISSN 1135-5948 © 2011 Sociedad Española Para el Procesamiento del Lenguaje Natural Laura Plaza y Alberto Díaz tried to find salient sentences in the text ac- that are also found in the headings of the doc- cording to their similarity to other sentences, ument (Edmundson, 1969; Brandow, Mitze, computing this similarity as the cosine dis- and Rau, 1995). These attributes are usually tance between their term vectors (Erkan and weighted and combined using a linear func- Radev, 2004). However, few approaches have tion that assesses a single score for each sen- dealt with the text at the semantic level, and tence in the document. Most advanced tech- rarely explore more complex representations niques concern the use of graph-based meth- based on concepts and semantic relations. ods to rank textual units for extraction. This In this paper, we examine the use and work mainly investigates previous work relat- strength of concept graphs to identify the ed to these techniques because the method central topics covered in a text, as a previous proposed here clearly falls under this catego- step to rank the sentences for the summa- ry. Graph-based methods usually represent ry. To this aim, we construct a graph where the documents as graphs, where the nodes each sentence is represented by the concepts correspond to text units (such as words, in WordNet that are found in it, and where phrases, sentences or even paragraphs), and the different concepts are interconnected to the edges represent cohesion relationships be- each other by a number of semantic relations. tween these units, or even similarity measures We identify salient concepts in this graph, between them (e.g. the Euclidean distance). based on the detection of hub or core ver- Once the graph for the document is creat- tices. These concepts constitute the centroids ed, the salient nodes are located in the graph of the clusters that delimitate the different and used to extract the corresponding units topics in the document. The ranking is based for the summary. on the presence in the sentences of the most LexRank (Erkan and Radev, 2004) is representative concepts for each topic. a well-know example of a centroid-based Our graph-based method has been evalu- method to multi-document summarization. ated on the Document Understanding Con- It assumes a fully connected and undirected ferences 2002 data1. We show that our graph, with sentences as nodes and similari- method performs significantly better than ty between them as edges. It represents the previously published approaches. This work sentences in each document by its TF-IDF also deals with the problem of word ambi- vectors and computes the sentence connectiv- guity, which inevitably arises when trying ity using the cosine similarity. A very similar to map the text to WordNet concepts, and method is proposed by Mihalcea and Tarau shows that applying a word sense disam- (2004) to perform mono-document summa- biguation algorithm benefit text summariza- rization. As in LexRank, the nodes represent tion. sentences and the edges represent the similar- ity between them, measured as a function of 2. Related Work their content overlap. Most recently, Litvak Text summarization is the process of auto- and Last (2008) proposed an approach that matically creating a compacted version of uses a graph-based syntactic representation a given text. Content reduction can be ad- for keyword extraction, which can be used as dressed by selection and/or by generalization a first step in summarization. However, most of what is important in the source (Sparck- of these systems ignore the latent semantic Jones, 1999). This definition suggests that associations that exist between the words, two generic groups of summarization meth- both intra and inter-sentence (e.g. synonymy, ods exist: those which generate extracts and hypernymy or co-occurrence relations). those which generate abstracts. In this pa- Consider the paragraph shown in Figure per, we focus on extractive methods; that is, 1. Approaches based on term frequencies and those which select sentences from the original mere syntactic representations do not suc- document to produce the summary. ceed in determining that the terms hurricane Traditional summarization systems typi- and cyclone are synonyms, and that both of cally rank the sentences using simple heuris- them are very close in meaning to the noun tic features such as the sentence position and phrase tropical storm. They do not detect the presence of certain cue words or terms that Puerto Rico, Virgin Islands and Domini- can Republic are hyponyms of the broader 1DUC Conferences: http://duc.nist.gov/ concept country, and that wind, rain and high 98 Using Semantic Graphs and Word Sense Disambiguation Techniques to Improve Text Summarization sea are types of atmospheric conditions usu- der to prepare the document for the subse- ally produced by hurricanes. quent steps. Irrelevant sections in the docu- ment (such as authors, source or publication date) are removed. Generic and high frequen- cy terms are also removed, using a stop list and the inverse document frequency (Sparck- Jones, 1972). The headline/title and body sections in the document are separated. Fi- nally, the text in the body section is split in- to sentences and the terms are tagged with Figure 1: A snippet of a news item that illus- their part of speech. trates the need to identify semantic relations Next, each sentence is translated to the between terms appropriate concepts in WordNet, using This problem can be partially solved by the WordNet::SenseRelate (WNSR) pack- 2 dealing with concepts instead of terms, and age (Patwardhan, Banerjee, and Pedersen, semantic relations instead of lexical or syn- 2005).

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us