Arxiv:2012.14919V2 [Cs.CL] 2 Jun 2021

WIKITABLET: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections Mingda Chen Sam Wiseman Kevin Gimpel Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA fmchen,swiseman,[email protected] Abstract tabular data and various metadata. The data resources we consider are relevant either to entire Datasets for data-to-text generation typically Wikipedia articles, such as Wikipedia infoboxes focus either on multi-domain, single-sentence and Wikidata tables, or to particular sections. Data generation or on single-domain, long-form from the latter category is built automatically generation. In this work, we cast generat- from either naturally-occurring hyperlinks or from ing Wikipedia sections as a data-to-text generation task and create a large-scale dataset, named entity recognizers. This data construction WIKITABLET, that pairs Wikipedia sections approach allows us to collect large quantities of with their corresponding tabular data and var- instances while still ensuring the coverage of the ious metadata. WIKITABLET contains mil- information in the table. We also perform various lions of instances, covering a broad range of types of filtering to ensure dataset quality. topics, as well as a variety of flavors of gen- WIKITABLET contains millions of instances eration tasks with different levels of flexibility. covering a broad range of topics and a variety of We benchmark several training and decoding flavors of generation with different levels of flexi- strategies on WIKITABLET. Our qualitative analysis shows that the best approaches can bility. Figure1 shows two examples from WIKI- generate fluent and high quality texts but they TABLET. The first instance has more flexibility as it struggle with coherence and factuality, show- involves generating a fictional character biography ing the potential for our dataset to inspire fu- in a comic book, whereas the second is more simi- 1 ture work on long-form generation. lar to standard data-to-text generation tasks, where the input tables contain all of the necessary informa- 1 Introduction tion for generating the text. While the open-ended instances in WIKITABLET are to some extent simi- Data-to-text generation (Kukich, 1983; McKeown, lar to story generation (Propp, 1968; McIntyre and 1992) is the task of generating text based on struc- Lapata, 2009; Fan et al., 2018), the fact that these tured data. Most existing data-to-text datasets focus instances are still constrained by the input tables on single-sentence generation, such as WIKIBIO enables different evaluation approaches and brings (Lebret et al., 2016), LogicNLG (Chen et al., 2020), new challenges (i.e., being coherent and faithful to and ToTTo (Parikh et al., 2020). Other datasets are the input tables at the same time). arXiv:2012.14919v2 [cs.CL] 2 Jun 2021 relatively small-scale and focus on long-form text Because of the range of knowledge-backed gen- generation, such as ROTOWIRE (Wiseman et al., eration instances in WIKITABLET, models trained 2017) and MLB (Puduppully et al., 2019). In this on our dataset can be used in assistive writing tech- work, we cast generating Wikipedia sections as a nologies for a broad range of topics and types of data-to-text generation task and build a large-scale knowledge. For example, technologies can aid stu- dataset targeting multi-sentence data-to-text gener- dents in essay writing by drawing from multiple ation with a variety of domains and data sources. kinds of factual sources. Moreover, WIKITABLET To this end, we create a dataset that we call can be used as a pretraining dataset for other rel- WIKITABLET (“Wikipedia Tables to Text”) that atively small-scale data-to-text datasets (e.g., RO- pairs Wikipedia sections with their corresponding TOWIRE). A similar idea that uses data-to-text gen- 1Code, data, and pretrained models are available at eration to create corpora for pretraining language https://github.com/mingdachen/WikiTableT models has shown promising results (Agarwal et al., 2021). creating a large-scale dataset containing multiple In experiments, we train several baseline models types of data-to-text instances. As shown in Ta- on WIKITABLET and empirically compare training ble1, WIKITABLET differs from these datasets in and decoding strategies. We find that the best train- that it is larger in scale and contains multi-sentence ing strategies still rely on enforcing hard constraints texts. More details are in the next section. to avoid overly repetitive texts. Human evaluations Wikipedia has also been used to construct reveal that (1) humans are unable to differentiate datasets for other text generation tasks, such as the human written texts from the generations from generating Wikipedia movie plots (Orbach and our neural models; (2) while the annotations show Goldberg, 2020; Rashkin et al., 2020) and short that grammatical errors in the reference texts and Wikipedia event summaries (Gholipour Ghalandari the generations may prevent humans from fully un- et al., 2020), and summarizing Wikipedia docu- derstanding the texts, the best decoding strategy ments (Zopf, 2018; Liu* et al., 2018) or summaries (i.e., beam search with n-gram blocking (Paulus of aspects of interests (Hayashi et al., 2020) from et al., 2018)) does not have such a problem and relevant documents. shows the best performance on several aspects; (3) As part of this work involves finding aligned ta- the degree of topical similarity between the gen- bles and text, it is related to prior work on aligning erations and the reference texts depends on the Wikipedia texts to knowledge bases (Elsahar et al., open-endedness of the instances. 2018; Logan et al., 2019). Our analysis shows that the generations are fluent and generally have high quality, but the models 3 The WIKITABLET Dataset sometimes struggle to generate coherent texts for The WIKITABLET dataset pairs Wikipedia sec- all the involved entities, suggesting future research tions2 with their corresponding tabular data and var- directions. For example, when the instance has a ious metadata; some of this data is relevant to entire high degree of flexibility, we find the models mak- Wikipedia articles (“article data”) or article struc- ing mistakes about what a particular entity type ture (“title data”), while some is section-specific is capable of. We also find errors in terms of the (“section data”). Each data table consists of a set factuality of the generated text, both in terms of of records, each of which is a tuple containing an contradictions relative to the tables and common- attribute and a value. sense violations. The instances in WIKITABLET cover a range of flavors of language generation. Some have more 2 Related Work flexibility, requiring models to generate coherent stories based on the entities and knowledge given in There have been efforts in creating data-to-text the tables. The first instance in Figure1 is such an datasets from various resources, including sports example. The text is from the Wikipedia article en- summaries (Wiseman et al., 2017; Puduppully et al., titled “Wolfsbane (comics)” and resides within two 2019), weather forecasts (Liang et al., 2009), and nested sections: the higher-level section “Fictional commentaries (Chen and Mooney, 2008). Most character biography” and the lower-level section of the recent datasets focus on generating single “Messiah Complex”. The task is challenging as sentences given tables, such as WIKIBIO, ToTTo, models need to generate a coherent passage that LogicNLG, and WikiTableText (Bao et al., 2018), can connect all the entities in the section data, and or other types of data formats, such as data triples the story also needs to fit the background knowl- (Vougiouklis et al., 2017; Gardent et al., 2017; edge provided in the article data. Nan et al., 2021), abstract meaning representations Other instances are more similar to standard data- (Flanigan et al., 2016), minimal recursion seman- to-text generation tasks, where the input tables con- tics (Hajdik et al., 2019), or a set of concepts (Lin tain all the necessary information for generating et al., 2020). Other than single sentences, there have been efforts in generating groups of sentences 2We define a Wikipedia section to be all text starting after a (sub)section heading and proceeding until the next describing humans and animals (Wang et al., 2018), (sub)section heading. We include Wikipedia sections at vari- and generating a post-modifier phrase for a tar- ous nesting levels. For example, a top level section may start get sentence given a sentence context (Kang et al., with a few paragraphs describing general information followed by two subsections with more specific information, in which 2019). In this work, our focus is long-form text case the example will be converted into three instances in our generation and we are interested in automatically dataset. Section Data Article Data During the 2007–2008 "Messiah Complex" storyline, Rahne Attribute Value Attribute Value helps Rictor infiltrate the Purifiers; she fakes being shot by Rictor. She is also a member of the new X-Force. During a PERSON Reavers birth name Rahne Sinclair battle against Lady Deathstrike and the Reavers, Rahne GPE Muir Island instance of superhero learns that Father Craig was in league with the Purifiers, group of fictional Purifiers (Marvel member of X-Men supposedly divulging enough information about her that the characters Comics) from narrative universe Marvel universe Purifiers can claim to "know her well.” She travels with X- DATE the 2007-2008 Title

Arxiv:2012.14919V2 [Cs.CL] 2 Jun 2021

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support