A Generic Neural Text Segmentation Model with Pointer Network

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) SEGBOT: A Generic Neural Text Segmentation Model with Pointer Network Jing Li, Aixin Sun and Shafiq Joty School of Computer Science and Engineering, Nanyang Technological University, Singapore [email protected], {axsun,srjoty}@ntu.edu.sg Abstract [A person]EDU [who never made a mistake]EDU [never tried any- thing new]EDU Text segmentation is a fundamental task in natural language processing that comes in two levels of Figure 1: A sentence with three elementary discourse units (EDUs). granularity: (i) segmenting a document into a se- and Thompson, 1988]. quence of topical segments (topic segmentation), Both topic and EDU segmentation tasks have received a and (ii) segmenting a sentence into a sequence of lot of attention in the past due to their utility in many NLP elementary discourse units (EDU segmentation). tasks. Although related, these two tasks have been addressed Traditional solutions to the two tasks heavily rely separately with different sets of approaches. Both supervised on carefully designed features. The recently propo- and unsupervised methods have been proposed for topic seg- sed neural models do not need manual feature en- mentation. Unsupervised topic segmentation models exploit gineering, but they either suffer from sparse boun- the strong correlation between topic and lexical usage, and dary tags or they cannot well handle the issue of can be broadly categorized into two classes: similarity-based variable size output vocabulary. We propose a ge- models and probabilistic generative models. The similarity- neric end-to-end segmentation model called SEG- based models are based on the key intuition that sentences BOT.SEGBOT uses a bidirectional recurrent neural in a segment are more similar to each other than to senten- network to encode input text sequence. The model ces in the preceding or the following segment. Examples then uses another recurrent neural network toget- of this category are TextTiling [Hearst, 1997], C99 [Choi, her with a pointer network to select text bounda- 2000], and LCSeg [Galley et al., 2003]. Probabilistic gene- ries in the input sequence. In this way, SEGBOT rative models are based on the intuition that a discourse is a does not require hand-crafted features. More im- hidden sequence of topics, each of which has its own charac- portantly, our model inherently handles the issue teristic word distribution. Variants of Hidden Markov Mo- of variable size output vocabulary and the issue of dels (HMMs) and Latent Dirichlet Allocations (LDAs) fall sparse boundary tags. In our experiments, SEGBOT into this class. Supervised topic segmentation models are outperforms state-of-the-art models on both topic more flexible in using more features (e.g., cue phrases, length and EDU segmentation tasks. and similarity scores) and generally perform better than unsupervised models, however, come with the price of efforts to 1 Introduction manually design informative features and to annotate large Text segmentation has been a fundamental task in natural lan- amount of data. Successful methods for EDU segmentation guage processing (NLP) that has been addressed at different are mostly supervised, and they use lexical and syntactic fea- levels of granularity. At a coarser level, text segmentation tures [Soricut and Marcu, 2003]. generally refers to breaking a document into a sequence of While most existing topic segmentation methods use lexi- topically coherent segments, often known as topic segmenta- cal similarity based on surface terms (i.e., words), it is now tion [Hearst, 1997; Choi, 2000]. Topic segmentation is often generally admitted that lexical semantics are better captured considered as a pre-requisite for other higher level discourse with distributed representations [Goldberg, 2017]. Further- analysis tasks like discourse parsing [Joty et al., 2015], and more, existing supervised models, for both topic and EDU has been shown to support a number of downstream NLP ap- segmentation, require a large set of features manually desig- plications including text summarization and passage retrie- ned for each task and domain, which demands task and dom- val [Riedl and Biemann, 2012]. At a finer level, text seg- ain expertise. We would envision for a system that is based mentation refers to breaking each sentence into a sequence on distributed representation, and that can learn informative of elementary discourse units (EDUs), often known as EDU features for each task and domain by itself without requiring segmentation [Marcu, 2000]. As exemplified in Figure 1, human effort. In this paper, we propose a generic neural ar- EDUs are clause-like units that serve as building blocks for chitecture that can achieve this goal. discourse parsing in Rhetorical Structure Theory [William Both topic and EDU segmentation can be treated as a se- 4166 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) &# &$ , - "# "$ "% One branch of unsupervised methods is based on the idea of /(01234 lexical cohesion, which states that similar vocabulary tends !! !! !! !! !! !!&'()*+ !! !! !! to be in a coherent segment. Hearst et al. [1997] introdu- 5301234 ced TextTiling, based on the fact that high vocabulary inter- ./012&3/425 "# "$ "% 3# 3$ 3% '()*+,"(-+. section between two adjacent blocks means high coherence (a) RNN-CRF (b) Sequence-to-sequence and vice versa. C99 [Choi, 2000] is an algorithm based on di- visive clustering with a matrix-ranking schema. LCSeg [Gal- Figure 2: Two classical models for sequence labeling. ley et al., 2003] uses a lexical chain to identify and weight quence labeling problem, where the task is to predict a se- word repetitions. U00 [Utiyama and Isahara, 2001] is a pro- quence of ‘yes/no’ boundary tags at the level of sentences babilistic approach using dynamic programming to find a in a document (for topic segmentation) or words in a sen- segmentation with minimum cost. Many unsupervised met- tence (for EDU segmentation). Conditional random fields hods are based on topic modeling, including TopSeg [Brants (CRFs) have been the traditional models for such sequence et al., 2002] and LDA based models [Misra et al., 2009; labeling tasks in NLP. More recently, recurrent neural net- Riedl and Biemann, 2012]. The idea is to induce semantic works with CRF output layer (RNN-CRF) as shown in Fi- relationship between words, and use topics assigned by topic gure 2(a) have provided state-of-art results in many sequence models to build sentence vector. tagging tasks in NLP [Lample et al., 2016]. However, due Supervised methods have also been proposed for text to the sparsity of ‘yes’ boundary tags in EDU and topic seg- segmentation. Hsueh et al. [2006] integrate lexical and mentation tasks, CRFs did not provide any additional gain conversation-based features for topic and sub-topic segmen- over simple classifiers like MaxEnt [Fisher and Roark, 2007; tation. Hernault et al. [2010] use CRF to train a discourse Joty et al., 2015]. segmenter with a set of lexical and syntactic features. Joty Instead, we cast our segmentation problems as sequence et al. [2015] train a binary classifier to decide whether to prediction tasks with seq2seq encoder-decoder models, which place an EDU boundary by using lexico-syntactic, shallow have been quite successful in machine translation and sum- syntactic and contextual features. Different from existing stu- marization. Figure 2(b) shows a toy seq2seq model, which dies where features have to be carefully hand-crafted, our ap- uses an RNN to encode an input sequence (U1;U2;U3), and proach does not need feature engineering. then uses another RNN as a language model to generate the Sequence Labeling. Sequence labeling is a fundamental output sequence (O1;O2). However, one limitation of this task in NLP. Traditional methods employ machine learning basic model is that the output vocabulary (i.e., from which models like hidden markov models [Qiao et al., 2015] and O1 and O2 are drawn) is fixed so that it needs to train diffe- CRFs [Liu et al., 2011], and have achieved relatively high rent models with respect to different vocabularies. Whereas performance. The drawback of these methods is that they re- in our tasks, the segmentation positions depend on the input quire domain-specific knowledge in the form of hand-crafted sequence. For example, in the input sequence in Figure 3, features and data pre-processing. U3 U6 U8 there are three segment boundaries – units , , and . Recently, neural network models have been successfully To alleviate these issues, we propose SEGBOT, a generic applied to sequence labeling. For instance, Long Short-Term end-to-end neural model for text segmentation at various le- Memory (LSTM) network has been used to encode the in- vels of granularity. SEGBOT uses distributed representations put sequence, and a CRF layer is used to decode tag se- to better capture lexical semantics, and employs a bidirecti- quence [Lample et al., 2016]. However, in text segmenta- onal RNN to model sequential dependencies while encoding tion tasks, segment boundaries are sparse making CRFs less a text. The decoder, which is an unidirectional RNN, uses a effective [Fisher and Roark, 2007; Joty et al., 2015]. Vas- [ ] pointer mechanism Vinyals et al., 2015 to infer the segment wani et al. [2016] use seq2seq model to output tag sequence boundaries. In addition, SEGBOT can effectively handle vari- through a LSTM layer. The drawback of these approaches is able size vocabulary in the output to produce segment boun- that the output dictionary is fixed and is not dependent on the daries depending on the input sequence. input sequence. By using a pointer-based neural network, our In summary, we make the following contributions: model overcomes both shortcomings simultaneously. • We propose SEGBOT– a generic end-to-end model for text segmentation at various levels of granularity. SEG- 3 SEGBOT: Neural Text Segmentation Model BOT learns informative features automatically while al- leviating the problem of tag sparsity in output sequence To address the problem of sparse boundary tags and variable and the problem of variable size output vocabulary.

A Generic Neural Text Segmentation Model with Pointer Network

Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann

Multiple Segmentations of Thai Sentences for Neural Machine Translation

A Clustering-Based Algorithm for Automatic Document Separation

An Incremental Text Segmentation by Clustering Cohesion

A Text Denormalization Algorithm Producing Training Data for Text Segmentation

Topic Segmentation: Algorithms and Applications

Text Segmentation Techniques: a Critical Review

Text Segmentation Based on Semantic Word Embeddings

Steps Involved in Text Recognition and Recent Research in OCR; a Study

Word Sense Disambiguation and Text Segmentation Based on Lexical

Multiple Segmentations of Thai Sentences for Neural Machine

A Journey Through Existing Procedures on Ocr Recognition and Text