Automatic Documents Summarization Using Ontology Based Methodologies

UNIVERSITY OF BIRMINGHAM Automatic Documents Summarization Using Ontology based Methodologies Abdullah Bawakid Thesis submitted for the degree of Doctor of Philosophy School of Electronic, Electrical and Computer Engineering College of Engineering and Physical Sciences University of Birmingham University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder. ABSTRACT Automatic summarization is the process of creating a condensed form of a document or several documents. When humans summarize a document they usually read the text first, understand it then attempt to write a summary. In essence, these processes require at least some basic level of background knowledge by the reader. At the very least, the human would have to understand the Natural Language that the text is written in. In this thesis, an attempt is made to bridge the gap of machines’ understanding by proposing a framework backed with knowledge repositories constructed by humans and containing real human concepts. I use WordNet, a hierarchically-structured repository that was created by linguistic experts and is rich in its explicitly defined lexical relations. With WordNet, algorithms for computing the semantic similarity between terms are proposed and implemented. The relationship between terms, and a composite of terms, is quantified and weighted through new algorithms allowing for grouping the terms, phrases and sentences based on the semantic meaning they carry. These algorithms are especially useful when applied to the application of Automatic Documents Summarization as shown with the obtained evaluation results. Several novel methods are also adapted to enhance the diversity and reduce redundancy in the generated summaries. I also use Wikipedia, the largest encyclopaedia to date. Because of its openness and structure, three problems had to be handled: Extracting knowledge and features from Wikipedia, enriching the representation of text documents with the extracted features, and using them in the application of Automatic Summarization. First, I show how the structure and content of Wikipedia can be used to build vectors representing human concepts. Second, I illustrate how these vectors can be mapped to text documents and how the semantic relatedness between text fragments is computed. Third, I describe a summarizer I built which utilizes the extracted features from Wikipedia and present its performance. I demonstrate how the Wikipedia-extracted features can be adapted in applications other than Automatic Summarization such as Word Sense Disambiguation and Automatic Classification. A description for the implemented system and the algorithms used is provided in this thesis along with an evaluation. ii DECLARATION I confirm that this is my own work and that the use of all material from other sources has been properly and fully acknowledged. Abdullah Bawakid iii ACKNOWLEDGMENTS I am extremely grateful to my supervisor Dr. Mourad Oussalah for investing so much time with me and the guidance he afforded me during my research. He gave me plenty of freedom to explore directions I was interested in. He has continued to be encouraging, patient and insightful. I thank also my backup supervisor Dr. Chris Baber for his support and commitment. I have received partial support funds from the University of Birmingham. Thank you for your generous support. I am very grateful to my parents, wife and family for their faith, support and patience. iv TABLE OF CONTENTS Chapter 1 ......................................................................................................................... 1 Introduction..................................................................................................................... 1 1.1 Motivation ......................................................................................................... 1 1.2 Aims and Contributions of the Thesis ............................................................ 10 1.3 Structure of the Document.............................................................................. 14 Chapter 2 ....................................................................................................................... 19 Background and Related Work .................................................................................... 19 2.1 Text Summarization........................................................................................ 19 2.2 Automatic Documents Summarization Systems ............................................ 22 2.3 Related Work .................................................................................................. 24 2.3.1 Surface-level Features............................................................................. 24 2.3.1.1 Word Frequency.................................................................................. 25 2.3.1.2 Position............................................................................................... 26 2.3.1.3 Cue words and phrases........................................................................ 26 2.3.1.4 Overlap with title or query................................................................... 27 2.3.2 Machine Learning Approaches ............................................................... 27 2.3.2.1 Naïve-Bayes Methods ......................................................................... 27 2.3.2.2 Decision Trees .................................................................................... 28 2.3.2.3 Hidden Markov Models....................................................................... 28 2.3.2.4 Log-Linear Models.............................................................................. 29 2.3.2.5 Neural Networks ................................................................................. 29 2.3.3 Natural Language Analysis Methods ...................................................... 30 2.3.3.1 Entity Level ........................................................................................ 30 2.3.3.2 Discourse Level .................................................................................. 32 2.3.4 Abstraction ............................................................................................. 33 2.3.5 Topic-driven Summarization .................................................................. 35 2.3.6 Graph-based Theories............................................................................. 36 2.3.7 LSA Methods ......................................................................................... 37 v 2.3.8 Task-specific Approaches....................................................................... 37 2.4 Examples of Automatic Summarizers ............................................................ 40 2.4.1 MEAD.................................................................................................... 41 2.4.2 Newsblaster............................................................................................ 41 2.4.3 QCS........................................................................................................ 42 2.4.4 MASC .................................................................................................... 42 2.4.5 Condensr ................................................................................................ 43 2.4.6 Open Text Summarizer........................................................................... 43 2.4.7 Commercial Summarizers....................................................................... 43 2.5 Limitations of Current Approaches ............................................................... 44 2.6 Text Summaries Evaluation............................................................................ 47 2.6.1 Intrinsic Evaluations ............................................................................... 47 2.6.2 Extrinsic Evaluations.............................................................................. 50 2.7 Conclusion ....................................................................................................... 51 Chapter 3 ....................................................................................................................... 53 Features Generation and Selection............................................................................... 53 3.1 Overview.......................................................................................................... 53 3.2 The Need for a Suitable Repository................................................................ 55 3.3 Using a Hierarchically Structured Repository............................................... 59 3.3.1 WordNet................................................................................................. 59 3.3.2 Semantic Similarity ................................................................................ 61 3.4 Using Open-World Knowledge....................................................................... 64 3.4.1 Wikipedia..............................................................................................

Automatic Documents Summarization Using Ontology Based Methodologies

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support