Text Segmentation in Novels

Chapter Captor: Text Segmentation in Novels Charuta Pethe, Allen Kim, Steven Skiena Department of Computer Science, Stony Brook University, NY, USA fcpethe,allekim,[email protected] Abstract titles. This physical segmentation improves the readability of long texts for human readers, provid- Books are typically segmented into chap- ing transition cues for breaks in the story. ters and sections, representing coherent sub- In this paper, we investigate the task of identi- narratives and topics. We investigate the task fying chapter boundaries in literary works, as a of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. proxy for that of large-scale text segmentation. The We build a Project Gutenberg chapter segmen- text of thousands of scanned books are available in tation data set of 9,126 English novels, using repositories such as Project Gutenberg (Gutenberg, a hybrid approach combining neural inference n.d.), making the chapter boundaries of these texts and rule matching to recognize chapter title an attractive source of annotations to study text headers in books, achieving an F1-score of segmentation. Unfortunately, the physical manifes- 0.77 on this task. Using this annotated data tations of the printed book have been lost in the as ground truth after removing structural cues, we present cut-based and neural methods for Gutenberg texts, limiting their usefulness for such chapter segmentation, achieving an F1-score studies. Chapter titles and numbers are retained in of 0.453 on the challenging task of exact break the texts but not systematically annotated: indeed prediction over book-length documents. Fi- they sit as hidden obstacles for most NLP analysis nally, we reveal interesting historical trends in of these texts. the chapter structure of novels. We develop methods for extracting ground truth chapter segmentation from Gutenberg texts, and 1 Introduction use this as training/evaluation data to build text Text segmentation (Hearst, 1994; Beeferman et al., segmentation systems to predict the natural bound- 1999) is a fundamental task in natural language aries of long narratives. Our primary contributions 1 processing, which seeks to partition texts into se- include: quences of coherent segments or episodes. Segmen- • Project Gutenberg Chapter Segmentation tation tasks differ widely in scale, from partitioning Resource: To create a ground-truth data set sentences into clauses to dividing large texts into for chapter segmentation, we developed a hy- coherent parts, where each segment is ideally an brid approach to recognizing chapter format- independent event occurring in the narrative. ting which is of independent interest. It com- Text segmentation plays an important role in bines a neural model with a regular expression many NLP applications including summarization, based rule matching system. Evaluation on information retrieval, and question answering. In a (noisy) silver-standard chapter partitioning the context of literary works, event detection is a yields a mean value F1 score of 0.77 of a test central concern in discourse analysis (Joty et al., set of 640 books, but manual investigation 2019). In order to obtain representations of events, shows this evaluation receives an artificially it is essential to identify narrative boundaries in the low recall score due to incorrect header tags text, where one event ends and another begins. in the silver-standard. In novels and related literary works, authors of- ten define such coherent segments by means of Our data set consists of 9,126 English fiction sections and chapters. Chapter boundaries are typ- books in the Project Gutenberg corpus. To ically denoted by formatting conventions such as 1All code and links to data are available at https:// page breaks, white-space, chapter numbers, and github.com/cpethe/chapter-captor. 8373 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 8373–8383, November 16–20, 2020. c 2020 Association for Computational Linguistics encourage further work on text segmentation a steady decline. Second, an analysis of reg- for narratives, we make the annotated chapter ular expression patterns reveal the wide vari- boundaries data publicly available for future ety of chapter header conventions and which research. forms dominate. • Local Methods for Chapter Segmentation: 2 Previous Work By concatenating chapter text following the Many approaches have been developed in recent removal of all explicit signals of chapter years to address variants of the task of identifying boundaries (white space and header notations), structural elements in books. we create a natural test bed to develop and McConnaughey et al.(2017) attempt this task evaluate algorithms for large-document text at the page-level, by assigning a label (e.g. Pref- segmentation. We develop two distinct ap- ace, Index, Table of Contents, etc.) to each page proaches for predicting the location of chap- of the book. Wu et al.(2013) address the task ter breaks: an unsupervised weighted-cut of recognizing and extracting tables of contents approach minimizing cross-boundary cross- from book documents, with a focus on identify- references, and a supervised neural network ing its style. Participants of the Book Structure building on the BERT language model (Devlin Extraction competition at ICDAR 2013 (Doucet et al., 2019). Both prove effective at identify- et al., 2013) attempted to use various approaches ing likely boundary sites, with F1 scores of for the task. These include making use of the ta- 0.164 and 0.447 respectively on the test set. ble of contents, OCR information, whitespace, and • Global Break Prediction using Optimiza- indentation.D ejean´ and Meunier(2005) present tion: Social conventions encourage authors to approaches to identify a table of contents in a book, maintain chapters of modest yet roughly equal andD ejean´ and Meunier(2009) attempt to struc- length. By incorporating length criteria into ture a document according to its table of contents. the desired optimization criteria and using dy- However, our approach relies only on text, and namic programming to find the best global does not require positional information or OCR solution enables us to control how important coordinates to extract front matter and headings it is to keep the segments equal. We find that from book texts. a balance between equal segments and model- For text segmentation, many approaches have influenced segments gives us the best segmen- been developed over the past years, suitable for tation, with minimal error. Indeed, augment- different types of data, such as news articles, sci- ing the BERT-based local classifier with dy- entific article, Wikipedia pages, and conversation namic programming yielded an F1 score of transcripts. 0.453 on the challenging task of exact break The TextTiling algorithm (Hearst, 1994) makes prediction over book-length documents, while use of lexical frequency distributions across blocks simultaneously beating challenging baselines of a fixed number of words. Dotplotting (Reynar, on two other error metrics. 1994) is a graphical technique to locate discourse boundaries using lexical cohesion across the entire Incorporating chapter length criteria require document. an independent estimate of the number of Yamron et al.(1998) and Beeferman et al.(1999) chapters in a given text. We demonstrate that propose methods to identify story boundaries in there are approximately five times as many news transcripts. likely break candidates as there are chapter The C99 algorithm (Choi, 2000) uses a global breaks in the weighted cut approach, reflect- lexical similarity matrix and a ranking scheme for ing the number of sub-events within an aver- divisive clustering. Choi et al.(2001) further pro- age book chapter. posed the use of Latent Semantic Analysis (LSA) • Historical Analysis of Segmentation Con- to compute inter-sentence similarity. ventions – We exploit our data analysis of seg- Utiyama and Isahara(2001) proposed a statis- mented books in two directions. We demon- tical model to find the maximum probability seg- strate that novels grew in length to an average mentation. The Minimum Cut model (Barzilay of roughly 30 chapters/book by 1800, and re- and Malioutov, 2006) addresses segmentation as a tained this length until 1875 before beginning graph partitioning task. 8374 This problem has also been addressed in a as ‘Preface’, ‘Table of contents’ etc. to identify Bayesian setting (Eisenstein and Barzilay, 2008; front matter. We tag all such content up to the first Eisenstein, 2009). TopicTiling (Riedl and Biemann, chapter heading as the front matter, and identify 2012) is a modification of the TextTiling algorithm, the remaining content as body. and makes use of LDA for topic modeling. 3.2.1 BERT Inference Segmentation using sentence similarity has been extensively explored using affinity propagation We fine-tune a pretrained BERT model (Devlin (Kazantseva and Szpakowicz, 2011; Sakahara et al., 2019) with a token classification head, to et al., 2014). More recent approaches (Alemi and identify the lines which are likely to be headers. Ginsparg, 2015; Glavasˇ et al., 2016) involve the use Training: For each header extracted from the of semantic representations of words to compute Project Gutenberg HTML files, we append con- sentence similarities. Koshorek et al.(2018) and tent from before and after the header, to generate Badjatiya et al.(2018) propose neural models to training sequences of fixed length. We empirically

Load more