Improving Summarization of Biomedical Documents Using Word Sense Disambiguation
Total Page:16
File Type:pdf, Size:1020Kb
Improving Summarization of Biomedical Documents using Word Sense Disambiguation Laura Plaza† Mark Stevenson∗ Alberto D´ıaz† [email protected] [email protected] [email protected] † Universidad Complutense de Madrid, C/Prof. Jose´ Garc´ıa Santesmases, 28040 Madrid, Spain ∗ University of Sheffield, Regent Court, 211 Portobello St., Sheffield, S1 4DP, UK Abstract al., 2004). These approaches can represent seman- tic associations between the words and terms in the We describe a concept-based summariza- document (i.e. synonymy, hypernymy, homonymy tion system for biomedical documents and or co-occurrence) and use this information to im- show that its performance can be improved prove the quality of the summaries. In the biomed- using Word Sense Disambiguation. The ical domain the Unified Medical Language Sys- system represents the documents as graphs tem (UMLS) (Nelson et al., 2002) has proved to formed from concepts and relations from be a useful knowledge source for summarization the UMLS. A degree-based clustering al- (Fiszman et al., 2004; Reeve et al., 2007; Plaza et gorithm is applied to these graphs to dis- al., 2008). In order to access the information con- cover different themes or topics within tained in the UMLS, the vocabulary of the doc- the document. To create the graphs, the ument being summarized has to be mapped onto MetaMap program is used to map the it. However, ambiguity is common in biomedi- text onto concepts in the UMLS Metathe- cal documents (Weeber et al., 2001). For exam- saurus. This paper shows that applying a ple, the string “cold” is associated with seven pos- graph-based Word Sense Disambiguation sible meanings in the UMLS Metathesuarus in- algorithm to the output of MetaMap im- cluding “common cold”, “cold sensation” , “cold proves the quality of the summaries that temperature” and “Chronic Obstructive Airway are generated. Disease”. The majority of summarization sys- tems in the biomedical domain rely on MetaMap 1 Introduction (Aronson, 2001) to map the text onto concepts Extractive text summarization can be defined as from the UMLS Metathesaurus (Fiszman et al., the process of determining salient sentences in a 2004; Reeve et al., 2007). However, MetaMap fre- text. These sentences are expected to condense quently fails to identify a unique mapping and, as the relevant information regarding the main topic a result, various concepts with the same score are covered in the text. Automatic summarization of returned. For instance, for the phrase “tissues are biomedical texts may benefit both health-care ser- often cold” MetaMap returns three equally scored vices and biomedical research (Reeve et al., 2007; concepts for the word ‘‘cold”: “common cold”, Hunter and Cohen, 2006). Providing physicians “cold sensation” and ”cold temperature”. with summaries of their patient records can help The purpose of this paper is to study the ef- to reduce the diagnosis time. Researchers can use fect of lexical ambiguity in the knowledge source summaries to quickly determine whether a docu- on semantic approaches to biomedical summariza- ment is of interest without having to read it all. tion. To this end, the paper describes a concept- Summarization systems usually work with a based summarization system for biomedical doc- representation of the document consisting of in- uments that uses the UMLS as an external knowl- formation that can be directly extracted from the edge source. To address the word ambiguity prob- document itself (Erkan and Radev, 2004; Mihalcea lem, we have adapted an existing WSD system and Tarau, 2004). However, recent studies have (Agirre and Soroa, 2009) to assign concepts from demonstrated the benefit of summarization based the UMLS. The system is applied to the summa- on richer representations that make use of external rization of 150 biomedical scientific articles from knowledge sources (Plaza et al., 2008; Fiszman et the BioMed Central corpus and it is found that 55 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, ACL 2010, pages 55–63, Uppsala, Sweden, 15 July 2010. c 2010 Association for Computational Linguistics WSD improves the quality of the summaries. This to represent the document at a conceptual level. In paper is, to our knowledge, the first to apply WSD particular, in the biomedical domain Reeve et al. to the summarization of biomedical documents (2007) adapt the lexical chaining approach (Barzi- and also demonstrates that this leads to an im- lay and Elhadad, 1997) to work with UMLS con- provement in performance. cepts, using the MetaMap Transfer Tool to anno- The next section describes related work on sum- tate these concepts. Yoo et al. (2007) represent a marization and WSD. Section 3 introduces the corpus of documents as a graph, where the nodes UMLS resources used in the WSD and sum- are the MeSH descriptors found in the corpus, and marization systems. Section 4 describes our the edges represent hypernymy and co-occurrence concept-based summarization algorithm. Section relations between them. They cluster the MeSH 5 presents a graph-based WSD algorithm which concepts in the corpus to identify sets of docu- has been adapted to assign concepts from the ments dealing with the same topic and then gen- UMLS. Section 6 describes the experiments car- erate a summary from each document cluster. ried out to evaluate the impact of WSD and dis- Word sense disambiguation attempts to solve cusses the results. The final section provides lexical ambiguities by identifying the correct concluding remarks and suggests future lines of meaning of a word based on its context. Super- work. vised approaches have been shown to perform bet- ter than unsupervised ones (Agirre and Edmonds, 2 Related work 2006) but need large amounts of manually-tagged data, which are often unavailable or impractical to Summarization has been an active area within create. Knowledge-based approaches are a good NLP research since the 1950s and a variety of ap- alternative that do not require manually-tagged proaches have been proposed (Mani, 2001; Afan- data. tenos et al., 2005). Our focus is on graph-based summarization methods. Graph-based approaches Graph-based methods have recently been shown typically represent the document as a graph, where to be an effective approach for knowledge-based the nodes represent text units (i.e. words, sen- WSD. They typically build a graph for the text in tences or paragraphs), and the links represent co- which the nodes represent all possible senses of hesion relations or similarity measures between the words and the edges represent different kinds these units. The best-known work in the area is of relations between them (e.g. lexico-semantic, LexRank (Erkan and Radev, 2004). It assumes a co-occurrence). Some algorithm for analyzing fully connected and undirected graph, where each these graphs is then applied from which a rank- node corresponds to a sentence, represented by ing of the senses of each word in the context is its TF-IDF vector, and the edges are labeled with obtained and the highest-ranking one is chosen the cosine similarity between the sentences. Mi- (Mihalcea and Tarau, 2004; Navigli and Velardi, halcea and Tarau (2004) present a similar method 2005; Agirre and Soroa, 2009). These methods where the similarity among sentences is measured find globally optimal solutions and are suitable for in terms of word overlaps. disambiguating all words in a text. However, methods based on term frequencies One such method is Personalized PageRank and syntactic representations do not exploit the se- (Agirre and Soroa, 2009) which makes use of mantic relations among the words in the text (i.e. the PageRank algorithm used by internet search synonymy, homonymy or co-occurrence). They engines (Brin and Page, 1998). PageRank as- cannot realize, for instance, that the phrases my- signs weight to each node in a graph by analyz- ocardial infarction and heart attack refer to the ing its structure and prefers ones that are linked to same concepts, or that pneumococcal pneumonia by other nodes that are highly weighted. Agirre and mycoplasma pneumonia are two similar dis- and Soroa (2009) used WordNet as the lexical eases that differ in the type of bacteria that causes knowledge base and creates graphs using the en- them. This problem can be partially solved by tire WordNet hierarchy. The ambiguous words in dealing with concepts and semantic relations from the document are added as nodes to this graph and domain-specific resources, rather than terms and directed links are created from them to each of lexical or syntactic relations. Consequently, some their possible meanings. These nodes are assigned recent approaches have adapted existing methods weight in the graph and the PageRank algorithm is 56 applied to distribute this information through the ple, the MRREL table states that C0009443 ‘Com- graph. The meaning of each word with the high- mon Cold’ and C0027442 ‘Nasopharynx’ are con- est weight is chosen. We refer to this approach nected via the RO relation. as ppr. It is efficient since it allows all ambigu- The MRHIER table in the Metathesaurus lists ous words in a document to be disambiguated si- the hierarchies in which each CUI appears, and multaneously using the whole lexical knowledge presents the whole path to the top or root of base, but can be misled when two of the possible each hierarchy for the CUI. For example, the senses for an ambiguous word are related to each MRHIER table states that C0035243 ‘Respiratory other in WordNet since the PageRank algorithm Tract Infections’ is a parent of C0009443 ‘Com- assigns weight to these senses rather than transfer- mon Cold’. ring it to related words. Agirre and Soroa (2009) The Semantic Network consists of a set of cat- also describe a variant of the approach, referred egories (or semantic types) that provides a consis- to as “word to word” (ppr w2w), in which a sep- tent categorization of the concepts in the Metathe- arate graph is created for each ambiguous word.