Lexical Cohesion Analysis for Topic Segmentation, Summarization and Keyphrase Extraction

LEXICAL COHESION ANALYSIS FOR TOPIC SEGMENTATION, SUMMARIZATION AND KEYPHRASE EXTRACTION a dissertation submitted to the department of computer engineering and the graduate school of engineering and science of bilkent university in partial fulfillment of the requirements for the degree of doctor of philosophy By Gönen¸cErcan December, 2012 I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Fazlı Can(Supervisor) I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Ilyas_ Çi¸cekli(Co-supervisor) I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. OzgürUlusoy¨ I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Ferda Nur Alpaslan ii I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Assoc. Prof. Dr. Hakan Ferhatosmano~glu I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Assist. Prof. Dr. Seyit Ko¸cberber Approved for the Graduate School of Engineering and Science: Prof. Dr. Levent Onural Director of the Graduate School iii ABSTRACT LEXICAL COHESION ANALYSIS FOR TOPIC SEGMENTATION, SUMMARIZATION AND KEYPHRASE EXTRACTION Gönen¸cErcan PhD. in Computer Engineering Supervisor: Prof. Dr. Fazlı Can December, 2012 When we express some idea or story, it is inevitable to use words that are seman- tically related to each other. When this phenomena is exploited from the aspect of words in the language, it is possible to infer the level of semantic relationship between words by observing their distribution and use in discourse. From the aspect of discourse it is possible to model the structure of the document by observing the changes in the lexical cohesion in order to attack high level natural language processing tasks. In this research lexical cohesion is investigated from both of these aspects by first building methods for measuring semantic relatedness of word pairs and then using these methods in the tasks of topic segmentation, summarization and keyphrase extraction. Measuring semantic relatedness of words requires prior knowledge about the words. Two different knowledge-bases are investigated in this research. The first knowledge base is a manually built network of semantic relationships, while the second relies on the distributional patterns in raw text corpora. In order to discover which method is effective in lexical cohesion analysis, a comprehensive comparison of state-of-the art methods in semantic relatedness is made. For topic segmentation different methods using some form of lexical cohesion are present in the literature. While some of these confine the relationships only to word repetition or strong semantic relationships like synonymy, no other work uses the semantic relatedness measures that can be calculated for any two word pairs in the vocabulary. Our experiments suggest that topic segmentation performance improves methods using both classical relationships and word repetition. Furthermore, the experiments compare the performance of different semantic relatedness methods in a high level task. The detected topic segments are used in iv v summarization, and achieves better results compared to a lexical chains based method that uses WordNet. Finally, the use of lexical cohesion analysis in keyphrase extraction is investigated. Previous research shows that keyphrases are useful tools in document retrieval and navigation. While these point to a relation between keyphrases and document retrieval performance, no other work uses this relationship to identify keyphrases of a given document. We aim to establish a link between the problems of query performance prediction (QPP) and keyphrase extraction. To this end, features used in QPP are evaluated in keyphrase extraction using a Naive Bayes classifier. Our experiments indicate that these features improve the effective- ness of keyphrase extraction in documents of different length. More importantly, commonly used features of frequency and first position in text perform poorly on shorter documents, whereas QPP features are more robust and achieve better results. Keywords: Lexical Cohesion, Semantic Relatedness, Topic Segmentation, Sum- marization, Keyphrase Extraction. OZET¨ KONU BOL¨ UMLEME,¨ OZETLEME¨ VE ANAHTAR KELIME_ ÇIKARMA IÇ_ IN_ KELIME_ BUT¨ UNL¨ UG¨ U¨ ANALIZ_ I_ Gönen¸cErcan Bilgisayar Mühendisli˘gi,Doktora Tez Yöneticisi:Prof. Dr. Fazlı Can Aralık, 2012 Insanlar_ bir fikri veya hikayeyi anlatırken birbiriyle anlam olarak ili¸skilikelimeleri kullanmaktan ka¸camazlar. Bu fenomenden iki farklı bakı¸sa¸cısıylafaydalanmak mümkündür. Kelimeler a¸cısındanbakıldı˜gında,anlam olarak ili¸skili kelimelerin istatistiksel da~gılımıve anlatımda kullanımlarına bakarak anlam olarak ili¸skili kelimeleri tanımlamak mümkünolabilir. Anlam bütünlü~güneanlatım a¸cısından baktı˜gımızdada kelimelerin anlam ili¸skilerindeki de~gi¸simebakarak bir metnin yapısını modellemek ve bu modeli farklı do~galdil i¸slemeproblemlerinde kullanmak mümkündür.Bu ara¸stırmadaanlam bütünlü~gü,bu iki a¸cıdanda incelenmek- tedir. Once¨ kelimeler arası anlam ili¸sikli~gininöl¸cülmesii¸cin anlam bütünlü~gükul- lanılmı¸sdaha sonra bu kelime ili¸skilerikonu bölümleme,özet¸cıkarma ve anahtar kelime ¸cıkarma problemlerinde kullanılmı¸stır. Kelimelerin anlam ili¸sikli~gininöl¸cülmesii¸cinbir bilgi da~garcı˜gıgerekmektedir. Ara¸stırmakapsamında iki farklı bilgi da~garcı˜gındanfaydalanılmaya ¸calı¸sılmı¸stır. Birinci kelime da~garcı˜gıkelime ili¸skilerininelle girildi~gibir anlam a~gıdır. Ikinci_ yöntem ise kelimelerin düzmetin derlemindeki kullanım da~gılımlarınıkullanmak- tadır. Ara¸stırmakapsamında bu yöntemlerin birbirine göreba¸sarımıöl¸cülmekte ve kapsamlı bir analiz yapılmaktadır. Konu bölümlemeprobleminde kelime bütünlü~gükullanan farklı yöntemler lit- eratürdekullanılmaktadır. Bunların bazıları sadece kelime tekrarlarından fay- dalanırken, bazıları da e¸sanlam gibi gü¸clüanlamsal ili¸skilerdenfaydalanmak- tadır. Fakat ¸suana kadar ¸cokdaha kapsamlı olan kelime ili¸sikli~giyöntemleri bu problemde kullanılmamı¸stır. Yapılan deneyler göstermektedir ki konu bölümlemeprobleminin ba¸sarımıkelime ili¸sikli~gi kullanılarak arttırılabilmektedir. vi vii Ayrıca deneyler farklı kelime ili¸sikli~giöl¸cümyöntemlerini kar¸sıla¸stırmaki¸cinkul- lanılabilmektedir. Konulara göre bölümlenmi¸smetinler otomatik özet¸cıkarma probleminde kullanılmı¸sve kelime zinciri tabanlı yöntemlere göredaha ba¸sarılı sonu¸clarelde etmi¸stir. Son olarak kelime bütünlü~gü analizi anahtar kelime bulma probleminde ara¸stırılmaktadır. Ge¸cmi¸s ara¸stırmalar anahtar kelimelerin belge getirme ve navigasyon i¸cin ba¸sarılı ara¸clar oldu~gunu göstermektedir. Her ne kadar bu ara¸stırmalaranahtar kelime ve belge getirme arasında bir ili¸skioldu~gunu gösterse de, ba¸ska bir ¸calı¸smada anahtar kelimeleri bulmak i¸cinonların belge getirme ba¸sarım tahmini kullanılmamı¸stır. Bu ara¸stırmada sorgu ba¸sarım tahmini yöntemlerinin anahtar kelime bulmada kullanımı incelenmi¸stir. Bunun i¸cin sorgu ba¸sarı tahmininde kullanılan öznitelikleranahtar kelime bulma probleminde Naive Bayes sınıflandırıcı ile birlikte kullanılmı¸stir. Yapılan deneyler bu özniteliklerin farklı boyuttaki belgelerde ba¸sarımıarttırdı˜gınıgöstermektedir. Daha da önemlisibu özniteliklerinyaygın olarak kullanılan deyim ge¸cmefrekansı ve belgede ilk kullanım yeri özniteliklerinintersine kısa belgelerde daha ba¸sarılı oldu~gunu göstermektedir. Anahtar sözcükler: Kelime bütünlü~gü,Anlamsal ili¸siklilik, Konu Bölümleme, Ozetleme,¨ Anahtar Kelime Çıkarma. Acknowledgement I would like to thank Prof. Ilyas_ Çi¸cekliwho has supervised and supported me since the day I started my graduate studies. I am indebted to Prof. Fazlı Can for all the encouragement and guidance he has provided me. Their support, expertise and invaluable recommendations were crucial in this long academic journey. I would like to thank all my committee members for spending their time and effort to read and comment on my dissertation. Many thanks to the former and current members of the Computer Engineering Department. It was both a pleasure and honour to get to know and work with you all. I would like to express my deepest gratitude to my family, Nurhayat-Sadık Ercan and Ilkay-Görkem_ Ercan and Berrin Gözenfor their patience and support throughout this long journey. Finally I thank and dedicate this dissertation to my wife Pınar who has given me both strength and courage to take on this challenge. viii Contents 1 Introduction 1 1.1 Goals and Contributions . 9 1.2 Outline . 11 2 Related Work 13 2.1 Related Linguistic Theories . 13 2.1.1 Semiotics and Meaning . 14 2.1.2 Analytic View . 15 2.1.3 Generative Grammar and Meaning . 15 2.2 Linguistic Background . 16 2.2.1 Morphology . 17 2.2.2 Syntax . 22 2.2.3 Coherence . 23 2.2.4 Cohesion . 24 2.2.5 Lexical Cohesion . 25 2.3 Literature Survey . 26 ix CONTENTS x 2.3.1 Semantic Relatedness . 27 2.3.2 Topic Segmentation . 31 2.3.3 Summarization . 36 2.3.4 Keyphrase Extraction . 41 3 Semantic Relatedness 44 3.1 Semantic Relatedness Methods . 45 3.1.1 WordNet Based Semantic Relatedness Measures . 46 3.1.2 WordNet based Semantic Relatedness Functions . 51 3.1.3 Corpora Based Semantic Relatedness Measures . 58 3.1.4 Building the Vocabulary . 59 3.1.5 Building the Co-occurrence Matrix .

Lexical Cohesion Analysis for Topic Segmentation, Summarization and Keyphrase Extraction

Web Document Analysis: How Can Natural Language Processing Help in Determining Correct Content Flow?

CS460/626 : Natural Language Processing/Speech, NLP and the Web

Machine Reading = Text Understanding! David Israel! What Is It to Understand a Text? !

Automatic Labeling of Troponymy for Chinese Verbs

Terry Ruas Final Dissertation.Pdf

Ontology and the Lexicon

Lexical Scoring System of Lexical Chain for Quranic Document Retrieval

WN-EWN Relations

Question-Answering Using Semantic Relation Triples Kenneth C

University of Hagen at CLEF 2004: Indexing and Translating Concepts for the GIRT Task

Text Relatedness Based on a Word Thesaurus

Neural Methods Towards Concept Discovery from Text Via Knowledge Transfer