A Neural Model for Coherent Topic Segmentation and Classification

SECTOR: A Neural Model for Coherent Topic Segmentation and Classification Sebastian Arnold Philippe Cudre-Mauroux´ Felix A. Gers Rudolf Schneider University of Fribourg Alexander Loser¨ Beuth University of Applied Fribourg, Switzerland Beuth University of Applied Sciences Berlin, Germany [email protected] Sciences Berlin, Germany fsarnold, ruschneiderg@ fgers, aloeserg@ beuth-hochschule.de beuth-hochschule.de Abstract words or full sentences (Hirschberg and Manning, 2015). Recent neural architectures build upon pre- When searching for information, a human trained word or sentence embeddings (Mikolov reader first glances over a document, spots et al., 2013; Le and Mikolov, 2014), which focus relevant sections, and then focuses on a few on semantic relations that can be learned from sentences for resolving her intention. How- large sets of paradigmatic examples, even from ever, the high variance of document structure long ranges (Dieng et al., 2017). complicates the identification of the salient From a human perspective, however, it is topic of a given section at a glance. To mostly the authors themselves who help best to tackle this challenge, we present SECTOR, a understand a text. Especially in long documents, model to support machine reading systems by an author thoughtfully designs a readable structure segmenting documents into coherent sections and guides the reader through the text by arranging and assigning topic labels to each section. topics into coherent passages (Glavasˇ et al., 2016). Our deep neural network architecture learns a In many cases, this structure is not formally ex- latent topic embedding over the course of a pressed as section headings (e.g., in news articles, document. This can be leveraged to classify reviews, discussion forums) or it is structured local topics from plain text and segment a according to domain-specific aspects (e.g., health document at topic shifts. In addition, we reports, research papers, insurance documents). contribute WikiSection, a publicly available Ideally, systems for text analytics, such as topic data set with 242k labeled sections in English detection and tracking (TDT) (Allan, 2002), text and German from two distinct domains: dis- summarization (Huang et al., 2003), information eases and cities. From our extensive evaluation retrieval (IR) (Dias et al., 2007), or question an- of 20 architectures, we report a highest score swering (QA) (Cohen et al., 2018), could access of 71.6% F1 for the segmentation and a document representation that is aware of both classification of 30 topics from the English topical (i.e., latent semantic content) and struc- city domain, scored by our SECTOR long tural information (i.e., segmentation) in the text short-term memory model with Bloom filter (MacAvaney et al., 2018). The challenge in embeddings and bidirectional segmentation. building such a representation is to combine these This is a significant improvement of 29.5 two dimensions that are strongly interwoven in points F1 over state-of-the-art CNN classifiers the author’s mind. It is therefore important to with baseline segmentation. understand topic segmentation and classification as a mutual task that requires encoding both topic 1 Introduction information and document structure coherently. 1 Today’s systems for natural language understand- In this paper, we present SECTOR, an end-to-end ing are composed of building blocks that extract model that learns an embedding of latent topics semantic information from the text, such as named entities, relations, topics, or discourse structure. 1Our source code is available under the Apache License In traditional natural language processing (NLP), 2.0 at https://github.com/sebastianarnold/ these extractors are typically applied to bags of SECTOR. 169 Transactions of the Association for Computational Linguistics, vol. 7, pp. 169–184, 2019. Action Editor: Radu Florian. Submission batch: 11/2018; Revision batch: 1/2019; Published 4/2019. c 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. from potentially ambiguous headings and can be mentary: Diseases is a typical scientific domain applied to entire documents to predict local topics with low entropy (i.e., very narrow topics, precise on sentence level. Our model encodes topical in- language, and low word ambiguity). In contrast, formation on a vertical dimension and structural cities resembles a diversified domain, with high information on a horizontal dimension. We show entropy (i.e., broader topics, common language, that the resulting embedding can be leveraged in and higher word ambiguity) and will be more a downstream pipeline to segment a document applicable to for example, news, risk reports, or into coherent sections and classify the sections travel reviews. into one of up to 30 topic categories reaching We compare SECTOR to existing segmenta- 71.6% F1—or alternatively, attach up to 2.8k topic tion and classification methods based on latent labels with 71.1% mean average precision (MAP). Dirichlet allocation (LDA), paragraph embed- We further show that segmentation performance dings, convolutional neural networks (CNNs), and of our bidirectional long short-term memory recurrent neural networks (RNNs). We show that (LSTM) architecture is comparable to specialized SECTOR significantly improves these methods in state-of-the-art segmentation methods on various a combined task by up to 29.5 points F1 when real-world data sets. applied to plain text with no given segmentation. To the best of our knowledge, the combined task The rest of this paper is structured as follows: of segmentation and classification has not been We introduce related work in Section 2. Next, we approached on the full document level before. describe the task and data set creation process in There exist a large number of data sets for text Section 3. We formalize our model in Section 4. segmentation, but most of them do not reflect We report results and insights from the evaluation real-world topic drifts (Choi, 2000; Sehikh et al., in Section 5. Finally, we conclude in Section 6. 2017), do not include topic labels (Eisenstein and Barzilay, 2008; Jeong and Titov, 2010; Glavasˇ 2 Related Work et al., 2016), or are heavily normalized and too small to be used for training neural networks (Chen The analysis of emerging topics over the course et al., 2009). We can utilize a generic segmentation of a document is related to a large number of data set derived from Wikipedia that includes research areas. In particular, topic modeling (Blei headings (Koshorek et al., 2018), but there is also et al., 2003) and TDT (Jin et al., 1999) focus on a need in IR and QA for supervised structural representing and extracting the semantic topical con- topic labels (Agarwal and Yu, 2009; MacAvaney tent of text. Text segmentation (Beeferman et al. et al., 2018), different languages and more specific 1999) is used to split documents into smaller co- domains, such as clinical or biomedical research herent chunks. Finally, text classification (Joachims (Tepper et al., 2012; Tsatsaronis et al., 2012), 1998) is often applied to detect topics on text and news-based TDT (Kumaran and Allan, 2004; chunks. Our method unifies those strongly inter- Leetaru and Schrodt, 2013). woven tasks and is the first to evaluate the com- 2 Therefore we introduce WIKISECTION, a large bined topic segmentation and classification task novel data set of 38k articles from the English using a corresponding data set with long structured and German Wikipedia labeled with 242k sec- documents. tions, original headings, and normalized topic Topic modeling is commonly applied to entire labels for up to 30 topics from two domains: dis- documents using probabilistic models, such as eases and cities. We chose these subsets to cover LDA (Blei et al., 2003). AlSumait et al. (2008) both clinical/biomedical aspects (e.g., symptoms, introduced an online topic model that captures treatments, complications) and news-based topics emerging topics when new documents appear. (e.g., history, politics, economy, climate). Both Gabrilovich and Markovitch (2007) proposed article types are reasonably well-structured ac- the Explicit Semantic Analysis method in which cording to Wikipedia guidelines (Piccardi et al., concepts from Wikipedia articles are indexed and 2018), but we show that they are also comple- assigned to documents. Later, and to overcome the vocabulary mismatch problem, Cimiano et al. (2009) introduced a method for assigning latent 2The data set is available under the CC BY-SA 3.0 license at https://github.com/sebastianarnold/ concepts to documents. More recently, Liu et al. WikiSection. (2016) represented documents with vectors of 170 closely related domain keyphrases. Yeh et al. that includes section headings. The authors intro- (2016) proposed a conceptual dynamic LDA duced a neural architecture for segmentation that model for tracking topics in conversations. Bhatia is based on sentence embeddings and four layers et al. (2016) utilized Wikipedia document titles of bidirectional LSTM. Similar to Sehikh et al. to learn neural topic embeddings and assign doc- (2017), the authors used a binary segmentation ument labels. Dieng et al. (2017) focused on the objective on the sentence level, but trained on issue of long-range dependencies and proposed a entire documents. Our work takes up this idea of latent topic model based on RNNs. However, the end-to-end training and enriches the neural model authors did not apply the RNN to predict local with a layer of latent topic embeddings that can topics. be utilized for topic classification. Text segmentation has been approached Text classification is mostly applied at the with a wide variety of methods. Early unsuper- paragraph or sentence level using machine vised methods utilized lexical overlap statistics learning methods such as support vector machines (Hearst 1997; Choi 2000), dynamic programming (Joachims, 1998) or, more recently, shallow and (Utiyama and Isahara, 2001), Bayesian models deep neural networks (Le et al., 2018; Conneau (Eisenstein and Barzilay, 2008), or pointwise et al., 2017). Notably, paragraph vectors (Le and boundary sampling (Du et al., 2013) on raw terms.

Load more