Topic Segmentation: Algorithms and Applications
Total Page:16
File Type:pdf, Size:1020Kb
University of Pennsylvania ScholarlyCommons IRCS Technical Reports Series Institute for Research in Cognitive Science 8-1-1998 Topic Segmentation: Algorithms And Applications Jeffrey C. Reynar University of Pennsylvania Follow this and additional works at: https://repository.upenn.edu/ircs_reports Part of the Databases and Information Systems Commons Reynar, Jeffrey C., "Topic Segmentation: Algorithms And Applications" (1998). IRCS Technical Reports Series. 66. https://repository.upenn.edu/ircs_reports/66 University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-98-21. This paper is posted at ScholarlyCommons. https://repository.upenn.edu/ircs_reports/66 For more information, please contact [email protected]. Topic Segmentation: Algorithms And Applications Abstract Most documents are aboutmore than one subject, but the majority of natural language processing algorithms and information retrieval techniques implicitly assume that every document has just one topic. The work described herein is about clues which mark shifts to new topics, algorithms for identifying topic boundaries and the uses of such boundaries once identified. A number of topic shift indicators have been proposed in the literature. We review these features, suggest several new ones and test most of them in implemented topic segmentation algorithms. Hints about topic boundaries include repetitions of character sequences, patterns of word and word n-gram repetition, word frequency, the presence of cue words and phrases and the use of synonyms. The algorithms we present use cues singly or in combination to identify topic shifts in several kinds of documents. One algorithm tracks compression performance, which is an indicator of topic shift because self-similarity within topic segments should be greater than between-segment similarity. Another technique relies on word repetition and places boundaries by minimizing word repetitions across segment boundaries. A third method compares the performance of a language model with and without knowledge of the contents of preceding sentences to determine whether a topic shift has occurred. We use the output of this algorithm in a statistical model which incorporates synonymy, bigram repetition and other features for topic segmentation. We benchmark our algorithms and compare them to algorithms from the literature using concatenations of documents, and then perform further evaluation of our techniques using a collection of news broadcasts transcribed both by annotators and using a speech recognition system. We also test the effectiveness of our algorithms for identifying both chapter boundaries in works of literature and story boundaries in Spanish news broadcasts. We suggest ways to improve information retrieval, language modeling and various natural language processing algorithms by exploiting the topic segmentation. Disciplines Databases and Information Systems Comments University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-98-21. This thesis or dissertation is available at ScholarlyCommons: https://repository.upenn.edu/ircs_reports/66 TOPIC SEGMENTATION ALGORITHMS AND APPLICATIONS Jerey C Reynar A DISSERTATION in Computer and Information Science Presented to the Faculties of the UniversityofPennsylvania in Partial Fulllment of the Requirements for the Degree of Do ctor of Philosophy Mitchell P Marcus Adviser Mark Steedman Graduate Group Chair COPYRIGHT Jerey C Reynar In memory of Mom and Dad iii Acknowledgements Like all research the work describ ed in this dissertation was not conducted in isolation It is imp ossible to thank everyone who has in some way contributed to it and equally imp ossible to thank individuals for all of their contributions I am indebted for technical insights encouragement and friendship to manymemb ers of the vibrant research commu nity at the UniversityofPennsylvania as well as those I had the privilege of working with at the Summer Institute of Linguistics IBMs TJ Watson Research Center Fuji Xerox and Infonautics I rst want to thank my advisor Mitch Marcus for providing me with many good ideas inspiring others in me and helping me rene and develop my own Thanks also to Mitchforallowing me the freedom to pursue the MUC and LexisNexis pro jects and to sp end several summers working in industry Thanks to the memb ers of my thesis committee Julia Hirschb erg Aravind Joshi Mark Lib erman and Lyle Ungar for providing suggestions which improved the quality of this work Additional thanks to Aravind Mark and Mitch for serving on my WPE commit tee Ive b eneted from conversations with and used natural language pro cessing to ols de velop ed by the memb ers of the UniversityofPennsylvania MUC team Breck Baldwin Mike Collins Jason Eisner Adwait Ratnaparkhi Joseph Rosenzweig Ano op Sarkar and B Srinivas Adwait deserves sp ecial thanks for p ermitting me to use his maximum entropy Chandrasekar Christy Doran Beth Ann Ho ckey mo deling to ols Thanks also to Mickey Al Kim Dan Melamed Tom Morton Michael Niv Martha Palmer MikeSchultz iv Matthew Stone and David Yarowsky for helpful discussions and feedback My time at Penn was made more enjoyable by the friendship of each of these p eople as well I also greatly proted from discussions with researchers outside of Penn including Ken Church Lance Ramshaw Salim Roukos Penni Sibun Larry Spitz and Mark Wasson Thanks also to Kathy McCoy my de facto undergraduate advisor for intro ducing me to computational linguistics Thanks to Betsy Norman and Amy Dunn for regularly squeezing me into Mitchs schedule on short notice and for go o d conversation while waiting to see him Im grateful to MikeFelker for help getting over the administrativehurdles I faced as a graduate student and to the sta of the CIS business oce for helping me navigate the complexities of the UniversityofP ennsylvanias payroll and billing systems FinallyI owe the most gratitude to my parents who always encouraged me to pursue my dreams and strive to do my b est and to my wife Angela for picking up where they left o v Abstract TOPIC SEGMENTATION ALGORITHMS AND APPLICATIONS Jerey C Reynar Sup ervisor Mitchell P Marcus Most do cuments are about more than one sub ject but the ma jority of natural language pro cessing algorithms and information retrieval techniques implicitly assume that every do cument has just one topic The work describ ed herein is ab out clues which mark shifts to new topics algorithms for identifying topic b oundaries and the uses of such b oundaries once identied A number of topic shift indicators have been prop osed in the literature We review these features suggest several new ones and test most of them in implemented topic segmentation algorithms Hints ab out topic b oundaries include rep etitions of character sequences patterns of word and word ngram rep etition word frequency the presence of cue words and phrases and the use of synon yms The algorithms we present use cues singly or in combination to identify topic shifts in several kinds of do cuments One algorithm tracks compression p erformance which is an indicator of topic shift b ecause selfsimilarity within topic segments should b e greater than betweensegment similarity Another technique relies on word rep etition and places b oundaries by minimizing word rep etitions across segment b oundaries A third metho d compares the p erformance of a language mo del with and without knowledge of the contents of preceding sentences to determine whether a topic shift has o ccurred We use the output vi of this algorithm in a statistical mo del which incorp orates synonymy bigram rep etition and other features for topic segmentation We b enchmark our algorithms and compare them to algorithms from the literature using concatenations of do cuments and then p erform further evaluation of our techniques using a collection of news broadcasts transcrib ed b oth by annotators and using a sp eech recognition system We also test the eectiveness of our algorithms for identifying both chapter b oundaries in works of literature and story b oundaries in Spanish news broadcasts We suggest ways to improve information retrieval language mo deling and various nat ural language pro cessing algorithms by exploiting the topic segmentation vii Contents Acknowledgements iv Abstract vi Intro duction Motivation Segmentation Cues The Noisy Channel Mo del Goals Topic Segmentation not Discourse Segmentation Outline Theories of Discourse Structure Halliday and Hasan Grosz Sidner Skoro cho dko Structuring Clues Currently Uncomputable Features Grimes Nakhimovsky Summary Currently Computable Features Cue Words Phrases viii First Uses Word Rep etition Word ngram Rep etition W ord Frequency Synonymy Named Entities Pronoun Usage Character ngram Rep etition