Text Segmentation As a Supervised Learning Task

Text Segmentation as a Supervised Learning Task Omri Koshorek∗ Adir Cohen∗ Noam Mor Michael Rotman Jonathan Berant School of Computer Science Tel-Aviv University, Israel omri.koshorek,adir.cohen,noam.mor,michael.rotman,joberant @cs.tau.ac.il { } Abstract Recent developments in Natural Language Pro- cessing have demonstrated that casting problems Text segmentation, the task of dividing a document into contiguous segments based on its as supervised learning tasks over large amounts semantic structure, is a longstanding challenge of labeled data is highly effective compared to in language understanding. Previous work heuristic-based systems or unsupervised algo- on text segmentation focused on unsupervised rithms (Mikolov et al., 2013; Pennington et al., methods such as clustering or graph search, 2014). Therefore, in this work we (a) formulate due to the paucity in labeled data. In this text segmentation as a supervised learning prob- work, we formulate text segmentation as a lem, where a label for every sentence in the docu- supervised learning problem, and present a ment denotes whether it ends a segment, (b) de- large new dataset for text segmentation that is automatically extracted and labeled from scribe a new dataset, WIKI-727K, intended for Wikipedia. Moreover, we develop a segmen- training text segmentation models. tation model based on this dataset and show WIKI-727K comprises more than 727,000 doc- that it generalizes well to unseen natural text. uments from English Wikipedia, where the table of contents of each document is used to automat- 1 Introduction ically segment the document. Since this dataset Text segmentation is the task of dividing text into is large, natural, and covers a variety of topics, segments, such that each segment is topically co- we expect it to generalize well to other natural herent, and cutoff points indicate a change of topic texts. Moreover, WIKI-727K provides a better (Hearst, 1994; Utiyama and Isahara, 2001; Brants benchmark for evaluating text segmentation mod- et al., 2002). This provides basic structure to a els compared to existing datasets. We make WIKI- document in a way that can later be used by down- 727K and our code publicly available at https: stream applications such as summarization and in- //github.com/koomri/text-segmentation. formation extraction. To demonstrate the efficacy of this dataset, we Existing datasets for text segmentation are small develop a hierarchical neural model in which a in size (Choi, 2000; Glavasˇ et al., 2016), and are lower-level bidirectional LSTM creates sentence used mostly for evaluating the performance of seg- representations from word tokens, and then a mentation algorithms. Moreover, some datasets higher-level LSTM consumes the sentence repre- (Choi, 2000) were synthesized automatically and sentations and labels each sentence. We show that thus do not represent the natural distribution of our model outperforms prior methods, demon- text in documents. Because no large labeled strating the importance of our dataset for future dataset exists, prior work on text segmentation progress in text segmentation. tried to either come up with heuristics for iden- tifying whether two sentences discuss the same 2 Related Work topic (Choi, 2000; Glavasˇ et al., 2016), or to model topics explicitly with methods such as LDA (Blei 2.1 Existing Text Segmentation Datasets et al., 2003) that assign a topic to each paragraph The most common dataset for evaluating perfor- or sentence (Chen et al., 2009). mance on text segmentation was created by Choi ∗Both authors contributed equally to this paper and the (2000). It is a synthetic dataset containing 920 order of authorship was determined randomly. documents, where each document is a concatena- 469 Proceedings of NAACL-HLT 2018, pages 469–473 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics tion of 10 random passages from the Brown cor- segments, which are typically vary in topic – for pus. Glavasˇ et al.(2016) created a dataset of their example, “History”, “Geography”, and “Demo- own, which consists of 5 manually-segmented graphics”. For segmenting a radio broadcast into political manifestos from the Manifesto project.1 separate news stories, which requires finer gran- (Chen et al., 2009) also used English Wikipedia ularity, it makes sense to train a model to pre- documents to evaluate text segmentation. They de- dict sub-segments. Our dataset provides the entire fined two datasets, one with 100 documents about segmentation information, and an application may major cities and one with 118 documents about choose the appropriate level of granularity. chemical elements. Table1 provides additional To generate the data, we performed the follow- statistics on each dataset. ing preprocessing steps for each Wikipedia docu- Thus, all existing datasets for text segmentation ment: are small and cannot benefit from the advantages Removed all photos, tables, Wikipedia tem- of training supervised models over labeled data. • plate elements, and other non-text elements. 2.2 Previous Methods Removed single-sentence segments, docu- • Bayesian text segmentation methods (Chen et al., ments with less than three segments, and doc- 2009; Riedl and Biemann, 2012) employ a gener- uments where most segments were filtered. Divided each segment into sentences using ative probabilistic model for text. In these mod- • els, a document is represented as a set of topics, the PUNKT tokenizer of the NLTK library which are sampled from a topic distribution, and (Bird et al., 2009). This is necessary for the each topic imposes a distribution over the vocab- use of our dataset as a benchmark, as with- ulary. Riedl and Biemann(2012) perform best out a well-defined sentence segmentation, it among this family of methods, where they define is impossible to evaluate different models. a coherence score between pairs of sentences, and We view WIKI-727K as suitable for text seg- compute a segmentation by finding drops in coher- mentation because it is natural, open-domain, and ence scores between pairs of adjacent sentences. has a well-defined segmentation. Moreover, neu- Another noteworthy approach for text segmen- ral network models often benefit from a wealth of tation is GRAPHSEG (Glavasˇ et al., 2016), an un- training data, and our dataset can easily be further supervised graph method, which performs com- expanded at very little cost. petitively on synthetic datasets and outperforms Bayesian approaches on the Manifesto dataset. 4 Neural Model for Text Segmentation GRAPHSEG works by building a graph where nodes are sentences, and an edge between two We treat text segmentation as a supervised learn- sentences signifies that the sentences are semanti- ing task, where the input x is a document, rep- cally similar. The segmentation is then determined resented as a sequence of n sentences s1, . , sn, and the label y = (y1, . , yn 1) is a segmentation by finding maximal cliques of adjacent sentences, − of the document, represented by n 1 binary val- and heuristically completing the segmentation. − ues, where yi denotes whether si ends a segment. 3 The WIKI-727K Dataset We now describe our model for text segmentation. Our neural model is composed of a hi- For this work we have created a new dataset, erarchy of two sub-networks, both based on the which we name WIKI-727K. It is a collection of LSTM architecture (Hochreiter and Schmidhuber, 727,746 English Wikipedia documents, and their 1997). The lower-level sub-network is a two-layer hierarchical segmentation, as it appears in their ta- bidirectional LSTM that generates sentence repre- ble of contents. We randomly partitioned the doc- sentations: for each sentence si, the network con- uments into a train (80%), development (10%), (i) (i) sumes the words w , . , w of s one by one, and test (10%) set. 1 k i and the final sentence representation e is com- Different text segmentation use-cases require i puted by max-pooling over the LSTM outputs. different levels of granularity. For example, for The higher-level sub-network is the segmenta- segmenting text by overarching topic it makes tion prediction network. This sub-network takes a sense to train a model that predicts only top-level sequence of sentence embeddings e1, . , en as in- 1https://manifestoproject.wzb.eu put, and feeds them into a two-layer bidirectional 470 optimize the parameter τ on our validation set, and use the optimal value while testing. 5 Experimental Details We evaluate our method on the WIKI-727 test set, Choi’s synthetic dataset, and the two small Wikipedia datasets (CITIES,ELEMENTS) intro- duced by Chen et al.(2009). We compare our model performance with those reported by Chen et al.(2009) and G RAPHSEG. In addition, we evaluate the performance of a random baseline model, which starts a new segment after every sen- 1 tence with probability k , where k is the average Figure 1: Our model contains a sentence embedding segment size in the dataset. sub-network, followed by a segmentation prediction Because our test set is large, it is difficult to sub-network which predicts a cut-off probability for evaluate some of the existing methods, which are each sentence. computationally demanding. Thus, we introduce WIKI-50, a set of 50 randomly sampled test doc- LSTM. We then apply a fully-connected layer on uments from WIKI-727K. We use WIKI-50 to each of the LSTM outputs to obtain a sequence 2 evaluate systems that are too slow to evaluate on of n vectors in R . We ignore the last vector (for the entire test set. We also provide human seg- e ), and apply a softmax function to obtain n 1 n − mentation performance results on WIKI-50. segmentation probabilities. Figure1 illustrates the We use the Pk metric as defined in Beeferman overall neural network architecture. et al.(1999) to evaluate the performance of our 4.1 Training model.

Load more