Text Segmentation by Cross Segment Attention

Text Segmentation by Cross Segment Attention Michal Lukasik, Boris Dadachev, Kishore Papineni, Gonçalo Simoes˜ Google Research fmlukasik,bdadachev,papineni,[email protected] Abstract Early life and marriage: Franklin Delano Roosevelt was born on January 30, 1882, in the Document and discourse segmentation are two Hudson Valley town of Hyde Park, New York, to businessman James Roosevelt I and his second wife, Sara Ann Delano. (...) fundamental NLP tasks pertaining to breaking Aides began to refer to her at the time as “the president’s girl- up text into constituents, which are commonly friend”, and gossip linking the two romantically appeared in the used to help downstream tasks such as infor- newspapers. mation retrieval or text summarization. In this (...) work, we propose three transformer-based ar- Legacy: Roosevelt is widely considered to be one of the most important chitectures and provide comprehensive com- figures in the history of the United States, as well as one of the parisons with previously proposed approaches most influential figures of the 20th century. (...) Roosevelt has on three standard datasets. We establish a new also appeared on several U.S. Postage stamps. state-of-the-art, reducing in particular the er- ror rates by a large margin in all cases. We Figure 1: Illustration of text segmentation on the ex- further analyze model sizes and find that we ample of the Wikipedia page of President Roosevelt. can build models with many fewer parameters The aim of document segmentation is breaking the raw while keeping good performance, thus facili- text into a sequence of logically coherent sections (e.g., tating real-world applications. “Early life and marriage” and “Legacy” in our example). 1 Introduction Text segmentation is a traditional NLP task that A related task called discourse segmentation breaks up text into constituents, according to prede- breaks up pieces of text into sub-sentence elements fined requirements. It can be applied to documents, called Elementary Discourse Units (EDUs). EDUs in which case the objective is to create logically are the minimal units in discourse analysis accord- coherent sub-document units. These units, or seg- ing to the Rhetorical Structure Theory (Mann and ments, can be any structure of interest, such as Thompson, 1988). In Figure2 we show examples paragraphs or sections. This task is often referred of EDU segmentations of sentences. For example, to as document segmentation or sometimes simply the sentence “Annuities are rarely a good idea at the text segmentation. In Figure1 we show one ex- age 35 because of withdrawal restrictions” decom- ample of document segmentation from Wikipedia, poses into the following two EDUs: “Annuities are on which the task is typically evaluated (Koshorek rarely a good idea at the age 35” and “because of et al., 2018; Badjatiya et al., 2018). withdrawal restrictions”, the first one being a state- Documents are often multi-modal, in that they ment and the second one being a justification in the cover multiple aspects and topics; breaking a doc- discourse analysis. In addition to being a key step ument into uni-modal segments can help improve in discourse analysis (Joty et al., 2019), discourse and/or speed up down stream applications. For segmentation has been shown to improve a number example, document segmentation has been shown of downstream tasks, such as text summarization, to improve information retrieval by indexing sub- by helping to identify fine-grained sub-sentence document units instead of full documents (Llopis units that may have different levels of importance et al., 2002; Shtekh et al., 2018). Other applications when creating a summary (Li et al., 2016). such as summarization and information extraction Multiple neural approaches have been recently can also benefit from text segmentation (Koshorek proposed for document and discourse segmenta- et al., 2018). tion. Koshorek et al.(2018) proposed the use of 4707 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4707–4716, November 16–20, 2020. c 2020 Association for Computational Linguistics Sentence 1: 2 Literature review Annuities are rarely a good idea at the age 35 k because of withdrawal restrictions Sentence 2: Document segmentation Many early research Wanted: k An investment k that’s as simple and secure as a efforts were focused on unsupervised text segmen- certificate of deposit k but offers a return k worth getting excited tation, doing so by quantifying lexical cohesion about. within small text segments (Hearst, 1997; Choi, Figure 2: Example discourse segmentations from the 2000). Being hard to precisely define and quan- RST-DT dataset (Carlson et al., 2001). In the segmentations, the EDUs are separated by the k character. tify, lexical cohesion has often been approximated by counting word repetitions. Although compu- tationally expensive, unsupervised Bayesian approaches have also been popular (Utiyama and Isa- hierarchical Bi-LSTMs for document segmenta- hara, 2001; Eisenstein, 2009; Mota et al., 2019). tion. Simultaneously, Li et al.(2018) introduced However, unsupervised algorithms suffer from two an attention-based model for both document seg- main drawbacks: they are hard to specialize for a mentation and discourse segmentation, and Wang given domain and in most cases do not naturally et al.(2018) obtained state of the art results on dis- deal with multi-scale issues. Indeed, the desired course segmentation using pretrained contextual segmentation granularity (paragraph, section, chap- embeddings (Peters et al., 2018). Also, a new large- ter, etc.) is necessarily task dependent and super- scale dataset for document segmentation based vised learning provides a way of addressing this on Wikipedia was introduced by Koshorek et al. property. Therefore, supervised algorithms have (2018), providing a much more realistic setup for been a focus of many recent works. evaluation than the previously used small scale and often synthetic datasets such as the Choi dataset In particular, multiple neural approaches have (Choi, 2000). been proposed for the task. In one, a sequence label- ing algorithm is proposed where each sentence is However, these approaches are evaluated on dif- encoded using a Bi-LSTM over tokens, and then a ferent datasets and as such have not been compared Bi-LSTM over sentence encodings is used to label against one another. Furthermore they mostly rely each sentence as ending a segment or not (Koshorek on RNNs instead of the more recent transformers et al., 2018). Authors consider a large dataset based (Vaswani et al., 2017) and in most cases do not on Wikipedia, and report improvements over un- make use of contextual embeddings which have supervised text segmentation methods. In another been shown to help in many classical NLP tasks work, a sequence-to-sequence model is proposed (Devlin et al., 2018). (Li et al., 2018), where the input is encoded using a In this work we aim at addressing these limita- BiGRU and segment endings are generated using a tions and make the following contributions: pointer network (Vinyals et al., 2015). The authors report significant improvements over sequence la- 1. We compare recent approaches that were pro- beling approaches, however on a dataset composed posed independently for text and/or discourse of 700 artificial documents created by concatenat- segmentation (Li et al., 2018; Koshorek et al., ing segments from random articles from the Brown 2018; Wang et al., 2018) on three public corpus (Choi, 2000). Lastly, Badjatiya et al.(2018) datasets. consider an attention-based CNN-Bi-LSTM model 2. We introduce three new model architectures and evaluate it on three small-scale datasets. based on transformers and BERT-style contextual embeddings to the document and dis- Discourse Segmentation Contrary to document course segmentation tasks. We analyze the segmentation, discourse segmentation has histor- strengths and weaknesses of each architecture ically been framed as a supervised learning task. and establish a new state-of-the-art. However, a challenge of applying supervised ap- 3. We show that a simple paradigm argued for proaches for this type of segmentation is the fact by some of the earliest text segmentation algo- that the available dataset for the task is limited rithms can achieve competitive performance (Carlson et al., 2001). For this reason, approaches in the current neural era. for discourse segmentation usually rely on exter- 4. We conduct ablation studies analyzing the im- nal annotations and resources to help the models portance of context size and model size. generalize. Early approaches to discourse segmen- 4708 tation were based on features from linguistic anno- In Figure 3(a) we illustrate the model. The input tations such as POS tags and parsing trees (Soricut is composed of a [CLS] token, followed by the two and Marcu, 2003; Xuan Bach et al., 2012; Joty contexts concatenated together, and separated by a et al., 2015). The performance of these systems [SEP] token. When necessary, short contexts are was highly dependent on the quality of the annota- padded to the left or to the right with [PAD] tokens. tions. [CLS], [SEP] and [PAD] are special tokens intro- Recent approaches started to rely on end-to-end duced by BERT (Devlin et al., 2018). They stand neural network models that do not need linguistic for, respectively, ”classification token” (since it is annotations to obtain high-quality results, relying typically for classification tasks, as a representation instead on pretrained models to obtain word or of the entire input sequence), ”separator token” and sentence representations. An example of such work ”padding token”. The input is then fed into a trans- is by Li et al.(2018), which proposes a sequence- former encoder (Vaswani et al., 2017), which is ini- to-sequence model getting a sequence of GloVe tialized with the publicly available BERTLARGE (Pennington et al., 2014) word embeddings as input model.

Text Segmentation by Cross Segment Attention

Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann

Multiple Segmentations of Thai Sentences for Neural Machine Translation

A Clustering-Based Algorithm for Automatic Document Separation

An Incremental Text Segmentation by Clustering Cohesion

A Text Denormalization Algorithm Producing Training Data for Text Segmentation

Topic Segmentation: Algorithms and Applications

A Generic Neural Text Segmentation Model with Pointer Network

Text Segmentation Techniques: a Critical Review

Text Segmentation Based on Semantic Word Embeddings

Steps Involved in Text Recognition and Recent Research in OCR; a Study

Word Sense Disambiguation and Text Segmentation Based on Lexical

Multiple Segmentations of Thai Sentences for Neural Machine