Lexical Cohesion Analysis for Topic Segmentation, Summarization and Keyphrase Extraction
Total Page:16
File Type:pdf, Size:1020Kb
LEXICAL COHESION ANALYSIS FOR TOPIC SEGMENTATION, SUMMARIZATION AND KEYPHRASE EXTRACTION a dissertation submitted to the department of computer engineering and the graduate school of engineering and science of bilkent university in partial fulfillment of the requirements for the degree of doctor of philosophy By G¨onen¸cErcan December, 2012 I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Fazlı Can(Supervisor) I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Ilyas_ C¸i¸cekli(Co-supervisor) I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Ozg¨urUlusoy¨ I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Prof. Dr. Ferda Nur Alpaslan ii I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Assoc. Prof. Dr. Hakan Ferhatosmano~glu I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Assist. Prof. Dr. Seyit Ko¸cberber Approved for the Graduate School of Engineering and Science: Prof. Dr. Levent Onural Director of the Graduate School iii ABSTRACT LEXICAL COHESION ANALYSIS FOR TOPIC SEGMENTATION, SUMMARIZATION AND KEYPHRASE EXTRACTION G¨onen¸cErcan PhD. in Computer Engineering Supervisor: Prof. Dr. Fazlı Can December, 2012 When we express some idea or story, it is inevitable to use words that are seman- tically related to each other. When this phenomena is exploited from the aspect of words in the language, it is possible to infer the level of semantic relationship between words by observing their distribution and use in discourse. From the aspect of discourse it is possible to model the structure of the document by ob- serving the changes in the lexical cohesion in order to attack high level natural language processing tasks. In this research lexical cohesion is investigated from both of these aspects by first building methods for measuring semantic relatedness of word pairs and then using these methods in the tasks of topic segmentation, summarization and keyphrase extraction. Measuring semantic relatedness of words requires prior knowledge about the words. Two different knowledge-bases are investigated in this research. The first knowledge base is a manually built network of semantic relationships, while the second relies on the distributional patterns in raw text corpora. In order to discover which method is effective in lexical cohesion analysis, a comprehensive comparison of state-of-the art methods in semantic relatedness is made. For topic segmentation different methods using some form of lexical cohesion are present in the literature. While some of these confine the relationships only to word repetition or strong semantic relationships like synonymy, no other work uses the semantic relatedness measures that can be calculated for any two word pairs in the vocabulary. Our experiments suggest that topic segmentation perfor- mance improves methods using both classical relationships and word repetition. Furthermore, the experiments compare the performance of different semantic re- latedness methods in a high level task. The detected topic segments are used in iv v summarization, and achieves better results compared to a lexical chains based method that uses WordNet. Finally, the use of lexical cohesion analysis in keyphrase extraction is inves- tigated. Previous research shows that keyphrases are useful tools in document retrieval and navigation. While these point to a relation between keyphrases and document retrieval performance, no other work uses this relationship to identify keyphrases of a given document. We aim to establish a link between the problems of query performance prediction (QPP) and keyphrase extraction. To this end, features used in QPP are evaluated in keyphrase extraction using a Naive Bayes classifier. Our experiments indicate that these features improve the effective- ness of keyphrase extraction in documents of different length. More importantly, commonly used features of frequency and first position in text perform poorly on shorter documents, whereas QPP features are more robust and achieve better results. Keywords: Lexical Cohesion, Semantic Relatedness, Topic Segmentation, Sum- marization, Keyphrase Extraction. OZET¨ KONU BOL¨ UMLEME,¨ OZETLEME¨ VE ANAHTAR KELIME_ C¸IKARMA IC¸_ IN_ KELIME_ BUT¨ UNL¨ UG¨ U¨ ANALIZ_ I_ G¨onen¸cErcan Bilgisayar M¨uhendisli˘gi,Doktora Tez Y¨oneticisi:Prof. Dr. Fazlı Can Aralık, 2012 Insanlar_ bir fikri veya hikayeyi anlatırken birbiriyle anlam olarak ili¸skilikelimeleri kullanmaktan ka¸camazlar. Bu fenomenden iki farklı bakı¸sa¸cısıylafaydalanmak m¨umk¨und¨ur. Kelimeler a¸cısındanbakıldı˜gında,anlam olarak ili¸skili kelimelerin istatistiksel da~gılımıve anlatımda kullanımlarına bakarak anlam olarak ili¸skili kelimeleri tanımlamak m¨umk¨unolabilir. Anlam b¨ut¨unl¨u~g¨uneanlatım a¸cısından baktı˜gımızdada kelimelerin anlam ili¸skilerindeki de~gi¸simebakarak bir metnin yapısını modellemek ve bu modeli farklı do~galdil i¸slemeproblemlerinde kullan- mak m¨umk¨und¨ur.Bu ara¸stırmadaanlam b¨ut¨unl¨u~g¨u,bu iki a¸cıdanda incelenmek- tedir. Once¨ kelimeler arası anlam ili¸sikli~ginin¨ol¸c¨ulmesii¸cin anlam b¨ut¨unl¨u~g¨ukul- lanılmı¸sdaha sonra bu kelime ili¸skilerikonu b¨ol¨umleme,¨ozet¸cıkarma ve anahtar kelime ¸cıkarma problemlerinde kullanılmı¸stır. Kelimelerin anlam ili¸sikli~ginin¨ol¸c¨ulmesii¸cinbir bilgi da~garcı˜gıgerekmektedir. Ara¸stırmakapsamında iki farklı bilgi da~garcı˜gındanfaydalanılmaya ¸calı¸sılmı¸stır. Birinci kelime da~garcı˜gıkelime ili¸skilerininelle girildi~gibir anlam a~gıdır. Ikinci_ y¨ontem ise kelimelerin d¨uzmetin derlemindeki kullanım da~gılımlarınıkullanmak- tadır. Ara¸stırmakapsamında bu y¨ontemlerin birbirine g¨oreba¸sarımı¨ol¸c¨ulmekte ve kapsamlı bir analiz yapılmaktadır. Konu b¨ol¨umlemeprobleminde kelime b¨ut¨unl¨u~g¨ukullanan farklı y¨ontemler lit- erat¨urdekullanılmaktadır. Bunların bazıları sadece kelime tekrarlarından fay- dalanırken, bazıları da e¸sanlam gibi g¨u¸cl¨uanlamsal ili¸skilerdenfaydalanmak- tadır. Fakat ¸suana kadar ¸cokdaha kapsamlı olan kelime ili¸sikli~giy¨ontemleri bu problemde kullanılmamı¸stır. Yapılan deneyler g¨ostermektedir ki konu b¨ol¨umlemeprobleminin ba¸sarımıkelime ili¸sikli~gi kullanılarak arttırılabilmektedir. vi vii Ayrıca deneyler farklı kelime ili¸sikli~gi¨ol¸c¨umy¨ontemlerini kar¸sıla¸stırmaki¸cinkul- lanılabilmektedir. Konulara g¨ore b¨ol¨umlenmi¸smetinler otomatik ¨ozet¸cıkarma probleminde kullanılmı¸sve kelime zinciri tabanlı y¨ontemlere g¨oredaha ba¸sarılı sonu¸clarelde etmi¸stir. Son olarak kelime b¨ut¨unl¨u~g¨u analizi anahtar kelime bulma probleminde ara¸stırılmaktadır. Ge¸cmi¸s ara¸stırmalar anahtar kelimelerin belge getirme ve navigasyon i¸cin ba¸sarılı ara¸clar oldu~gunu g¨ostermektedir. Her ne kadar bu ara¸stırmalaranahtar kelime ve belge getirme arasında bir ili¸skioldu~gunu g¨osterse de, ba¸ska bir ¸calı¸smada anahtar kelimeleri bulmak i¸cinonların belge getirme ba¸sarım tahmini kullanılmamı¸stır. Bu ara¸stırmada sorgu ba¸sarım tahmini y¨ontemlerinin anahtar kelime bulmada kullanımı incelenmi¸stir. Bunun i¸cin sorgu ba¸sarı tahmininde kullanılan ¨oznitelikleranahtar kelime bulma proble- minde Naive Bayes sınıflandırıcı ile birlikte kullanılmı¸stir. Yapılan deneyler bu ¨ozniteliklerin farklı boyuttaki belgelerde ba¸sarımıarttırdı˜gınıg¨ostermektedir. Daha da ¨onemlisibu ¨ozniteliklerinyaygın olarak kullanılan deyim ge¸cmefrekansı ve belgede ilk kullanım yeri ¨ozniteliklerinintersine kısa belgelerde daha ba¸sarılı oldu~gunu g¨ostermektedir. Anahtar s¨ozc¨ukler: Kelime b¨ut¨unl¨u~g¨u,Anlamsal ili¸siklilik, Konu B¨ol¨umleme, Ozetleme,¨ Anahtar Kelime C¸ıkarma. Acknowledgement I would like to thank Prof. Ilyas_ C¸i¸cekliwho has supervised and supported me since the day I started my graduate studies. I am indebted to Prof. Fazlı Can for all the encouragement and guidance he has provided me. Their support, expertise and invaluable recommendations were crucial in this long academic journey. I would like to thank all my committee members for spending their time and effort to read and comment on my dissertation. Many thanks to the former and current members of the Computer Engineering Department. It was both a pleasure and honour to get to know and work with you all. I would like to express my deepest gratitude to my family, Nurhayat-Sadık Ercan and Ilkay-G¨orkem_ Ercan and Berrin G¨ozenfor their patience and support throughout this long journey. Finally I thank and dedicate this dissertation to my wife Pınar who has given me both strength and courage to take on this challenge. viii Contents 1 Introduction 1 1.1 Goals and Contributions . 9 1.2 Outline . 11 2 Related Work 13 2.1 Related Linguistic Theories . 13 2.1.1 Semiotics and Meaning . 14 2.1.2 Analytic View . 15 2.1.3 Generative Grammar and Meaning . 15 2.2 Linguistic Background . 16 2.2.1 Morphology . 17 2.2.2 Syntax . 22 2.2.3 Coherence . 23 2.2.4 Cohesion . 24 2.2.5 Lexical Cohesion . 25 2.3 Literature Survey . 26 ix CONTENTS x 2.3.1 Semantic Relatedness . 27 2.3.2 Topic Segmentation . 31 2.3.3 Summarization . 36 2.3.4 Keyphrase Extraction . 41 3 Semantic Relatedness 44 3.1 Semantic Relatedness Methods . 45 3.1.1 WordNet Based Semantic Relatedness Measures . 46 3.1.2 WordNet based Semantic Relatedness Functions . 51 3.1.3 Corpora Based Semantic Relatedness Measures . 58 3.1.4 Building the Vocabulary . 59 3.1.5 Building the Co-occurrence Matrix .