TWAG: a Topic-Guided Wikipedia Abstract Generator

TWAG: A Topic-guided Wikipedia Abstract Generator Fangwei Zhu1;2, Shangqing Tu3, Jiaxin Shi1;2, Juanzi Li1;2, Lei Hou1;2∗ and Tong Cui4 1Dept. of Computer Sci.&Tech., BNRist, Tsinghua University, Beijing 100084, China 2KIRC, Institute for Artificial Intelligence, Tsinghua University 3School of Computer Science and Engineering, Beihang University 4Noah’s Ark Lab, Huawei Inc. fzfw19@mails.,shijx16@mails,lijuanzi@,[email protected] [email protected],[email protected] Abstract collected from referred websites or search engines, which is essentially a multi-document summariza- Wikipedia abstract generation aims to distill tion problem. This problem is studied in both ex- a Wikipedia abstract from web sources and tractive and abstractive manners. has met significant success by adopting multi- document summarization techniques. How- The extractive models attempt to select relevant ever, previous works generally view the ab- textual units from input documents and combine stract as plain text, ignoring the fact that it them into a summary. Graph-based representations is a description of a certain entity and can be are widely exploited to capture the most salient tex- decomposed into different topics. In this pa- tual units and enhance the quality of the final sum- per, we propose a two-stage model TWAG that mary (Erkan and Radev, 2004; Mihalcea and Tarau, guides the abstract generation with topical in- 2004; Wan, 2008). Recently, there also emerge formation. First, we detect the topic of each neural extractive models (Yasunaga et al., 2017; input paragraph with a classifier trained on existing Wikipedia articles to divide input docu- Yin et al., 2019) utilizing the graph convolutional ments into different topics. Then, we predict network (Kipf and Welling, 2017) to better capture the topic distribution of each abstract sentence, inter-document relations. However, these models and decode the sentence from topic-aware are not suitable for Wikipedia abstract generation. representations with a Pointer-Generator net- The reason is that the input documents collected work. We evaluate our model on the Wi- from various sources are often noisy and lack intrin- kiCatSum dataset, and the results show that sic relations (Sauper and Barzilay, 2009), which TWAG outperforms various existing baselines and is capable of generating comprehensive ab- makes the relation graph hard to build. stracts. Our code and dataset can be accessed The abstractive models aim to distill an informa- at https://github.com/THU-KEG/TWAG tive and coherent summary via sentence-fusion and paraphrasing (Filippova and Strube, 2008; Baner- 1 Introduction jee et al., 2015; Bing et al., 2015), but achieve little success due to the limited scale of datasets. Liu Wikipedia, one of the most popular crowd-sourced et al.(2018) proposes an extractive-then-abstractive online knowledge bases, has been widely used as model and contributes WikiSum, a large-scale the valuable resources in natural language process- dataset for Wikipedia abstract generation, inspiring arXiv:2106.15135v1 [cs.CL] 29 Jun 2021 ing tasks such as knowledge acquisition (Lehmann a branch of further studies (Perez-Beltrachini et al., et al., 2015) and question answering (Hewlett et al., 2019; Liu and Lapata, 2019; Li et al., 2020). 2016; Rajpurkar et al., 2016) due to its high qual- The above models generally view the abstract ity and wide coverage. Within a Wikipedia article, as plain text, ignoring the fact that Wikipedia ab- its abstract is the overview of the whole content, stracts describe certain entities, and the structure and thus becomes the most frequently used part in of Wikipedia articles could help generate compre- various tasks. However, the abstract is often con- hensive abstracts. We observe that humans tend tributed by experts, which is labor-intensive and to describe entities in a certain domain from sev- prone to be incomplete. eral topics when writing Wikipedia abstracts. As In this paper, we aim to automatically generate illustrated in Figure1, the abstract of the Arctic Wikipedia abstracts based on the related documents Fox contains its adaption, biology taxonomy and ∗ Corresponding Author geographical distribution, which is consistent with Abstract Content Table • Our experiment results against 4 distinct base- The Arctic fox (Vulpes lagopus), also known lines prove the effectiveness of TWAG. as the white fox, polar fox, or snow fox, is a 2 Adaptations small fox native to the Arctic regions of the 2.1 Sensory modalities Northern Hemisphere and common 2.2 Physiology 2 Related Work throughout the Arctic tundra biome. It is well 3 Size adapted to living in cold environments, and is best known for its thick, warm fur that is 4 Taxonomy 2.1 Multi-document Summarization also used as camouflage. It has a large and 4.1 Origins very fluffy tail. In the wild, most individuals 4.2 Subspecies Multi-document summarization is a classic and do not live past their first year but some exceptional ones survive up to 11 years. Its 5 Distribution and habitat challenging problem in natural language process- body length ranges from 46 to 68 cm (18 to 5.1 Migrations and travel 27 in), with a generally rounded body shape ing, which aims to distill an informative and co- to minimize the escape of body heat. herent summary from a set of input documents. Compared with single-document summarization, Figure 1: An example of Wikipedia article Arctic Fox. the input documents may contain redundant or even The abstract contains three orthogonal topics about an contradictory information (Radev, 2000). animal: Description, :::::::::Taxonomy and Distribution. The Early high-quality multi-document summariza- right half is part of the article’s content table, showing tion datasets are annotated by humans, e.g., section labels related to different topics. datasets for Document Understanding Conference (DUC) and Text Analysis Conference (TAC). These the content table. Therefore, given an entity in datasets are too small to build neural models, a specific domain, generating abstracts from cor- and most of the early works take an extractive responding topics would reduce redundancy and method, attempting to build graphs with inter- produce a more complete summary. paragraph relations and choose the most salient In this paper, we try to utilize the topical infor- textual units. The graph could be built with var- mation of entities within its domain (Wikipedia cat- ious information, e.g., TF-IDF similarity (Erkan egories) to improve the quality of the generated ab- and Radev, 2004), discourse relation (Mihalcea stract. We propose a novel two-stage Topic-guided and Tarau, 2004), document-sentence two-layer re- Wikipedia Abstract Generation model (TWAG). lations (Wan, 2008), multi-modal (Wan and Xiao, TWAG first divides input documents by paragraph 2009) and query information (Cai and Li, 2012). and assigns a topic for each paragraph with a Recently, there emerge attempts to incorporate classifier-based topic detector. Then, it generates neural models, e.g., Yasunaga et al.(2017) builds the abstract in a sentence-wise manner, i.e., pre- a discourse graph and represents textual units upon dicts the topic distribution of each abstract sen- the graph convolutional network (GCN) (Kipf and tence to determine its topic-aware representation, Welling, 2017), and Yin et al.(2019) adopts the and decodes the sentence with a Pointer-Generator entity linking technique to capture global depen- network (See et al., 2017). We evaluate TWAG on dencies between sentences and ranks the sentences the WikiCatSum (Perez-Beltrachini et al., 2019) with a neural graph-based model. dataset, a subset of the WikiSum containing three In contrast, early abstractive models using distinct domains. Experimental results show that it sentence-fusion and paraphrasing (Filippova and significantly improves the quality of abstract com- Strube, 2008; Banerjee et al., 2015; Bing et al., pared with several strong baselines. 2015) achieve less success. Inspired by the re- In conclusion, the contributions of our work are cent success of single-document abstractive mod- as follows: els (See et al., 2017; Paulus et al., 2018; Gehrmann et al., 2018; Huang et al., 2020), some works (Liu • We propose TWAG, a two-stage neural ab- et al., 2018; Zhang et al., 2018) try to transfer stractive Wikipedia abstract generation model single-document models to multi-document set- utilizing the topic information in Wikipedia, tings to alleviate the limitations of small-scale which is capable of generating comprehensive datasets. Specifically, Liu et al.(2018) defines abstracts. Wikipedia generation problem and contributes the • We simulate the way humans recognize en- large-scale WikiSum dataset. Fabbri et al.(2019) tities, using a classifier to divide input doc- constructs a middle-scale dataset named Multi- uments into topics, and then perform topic- News and proposes an extractive-then-abstractive aware abstract generation upon the predicted model by appending a sequence-to-sequence model topic distribution of each abstract sentence. after the extractive step. Li et al.(2020) models inter-document relations with explicit graph repre- the input, i.e., sentations, and incorporates pre-trained language S∗ = arg max P (SjD) (1) models to better handle long input documents. S 2.2 Wikipedia-related Text Generation Previous works generally view S as plain text, ignoring the semantics in Wikipedia articles. Before Sauper and Barzilay(2009) is the first work focus- introducing our idea, let’s review how Wikipedia ing on Wikipedia generation, which uses Integer organizes articles. Linear Programming (ILP) to select the useful sen- Wikipedia employs a hierarchical open category tences for Wikipedia abstracts. Banerjee and Mitra system to organize millions of articles, and we (2016) further evaluates the coherence of selected name the top-level category as domain. As for a sentences to improve the linguistic quality. Wikipedia article, we concern three parts, i.e., the Liu et al.(2018) proposes a two-stage extractive- abstract, the content table, and textual contents.

Load more