Wikigraphs: a Wikipedia Text - Knowledge Graph Paired Dataset
Total Page:16
File Type:pdf, Size:1020Kb
WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset Luyu Wang* and Yujia Li* and Ozlem Aslan and Oriol Vinyals *Equal contribution DeepMind, London, UK {luyuwang,yujiali,ozlema,vinyals}@google.com Freebase Paul Hewson Abstract WikiText-103 Bono type.object.name Where the streets key/wikipedia.en “Where the Streets Have have no name We present a new dataset of Wikipedia articles No Name” is a song by ns/m.01vswwx Irish rock band U2. It is the ns/m.02h40lc people.person.date of birth each paired with a knowledge graph, to facili- opening track from their key/wikipedia.en 1987 album The Joshua ns/music.composition.language music.composer.compositions Tree and was released as tate the research in conditional text generation, 1960-05-10 the album’s third single in ns/m.06t2j0 August 1987. The song ’s ns/music.composition.form graph generation and graph representation 1961-08-08 hook is a repeating guitar ns/m.074ft arpeggio using a delay music.composer.compositions people.person.date of birth learning. Existing graph-text paired datasets effect, played during the key/wikipedia.en song’s introduction and ns/m.01vswx5 again at the end. Lead David Howell typically contain small graphs and short text Evans, better vocalist Bono wrote... Song key/wikipedia.en known by his ns/common.topic.description stage name The (1 or few sentences), thus limiting the capabil- Edge, is a people.person.height meters The Edge British-born Irish musician, ities of the models that can be learned on the songwriter... 1.77m data. Our new dataset WikiGraphs is collected by pairing each Wikipedia article from the Figure 1: Illustration of a pair of Wikipedia article and established WikiText-103 benchmark (Merity the corresponding knowledge graph in our dataset. et al., 2016) with a subgraph from the Free- base knowledge graph (Bollacker et al., 2008). This makes it easy to benchmark against other We present a new dataset of Wikipedia text arti- state-of-the-art text generative models that are cles each paired with a relevant knowledge graph capable of generating long paragraphs of co- herent text. Both the graphs and the text data (KG), which enables building models that can gen- are of significantly larger scale compared to erate long text conditioned on a graph structured prior graph-text paired datasets. We present overview of relevant topics, and also models that baseline graph neural network and transformer extract or generate graphs from a text description. model results on our dataset for 3 tasks: graph There has been many prior efforts trying to build ! text generation, graph ! text retrieval and datasets for learning graph ! text generation mod- text ! graph retrieval. We show that better els (Jin et al., 2020; Gardent et al., 2017; Lebret conditioning on the graph provides gains in generation and retrieval quality but there is still et al., 2016). However, existing graph-text paired large room for improvement. 1 datasets are mostly small scale, where the graphs tend to have 10-20 or even less nodes, and the text 1 Introduction typically only contains one or a few sentences. This Parallel datasets that pair data from different represents a significant contrast with the state-of- sources and modalities have enabled large amounts the-art text generation models (Dai et al., 2019; of research on cross modality learning. Paired Brown et al., 2020), which can already generate image-caption datasets enable models to describe very fluent and long text that spans thousands of visual scenes in natural language (Lin et al., 2014; tokens over multiple paragraphs. Vinyals et al., 2016), paired streams of speech and We attempt to bridge this gap, with the goal of transcription data makes it possible to train speech advancing the state-of-the-art graph ! text genera- recognition systems (Garofolo et al., 1993; Panay- tion models, graph representation learning models otov et al., 2015) or text-to-speech synthesis mod- and also text-conditioned graph generative models. els (Oord et al., 2016), and parallel corpus of text in Each text document in our dataset is a full-length different languages enable learned machine transla- Wikipedia article, and we pair each of them with a tion models (Barrault et al., 2020). KG that are significantly bigger than prior datasets of similar nature and includes much richer infor- 1The data and the code to reproduce our baseline results are available at https://github.com/deepmind/ mation. Hand labelling text articles with KGs is deepmind-research/tree/master/wikigraphs expensive and not scalable (Lebret et al., 2016), 67 Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15), pages 67–82 June 11, 2021. ©2021 Association for Computational Linguistics therefore we utilize an existing and established Dataset #examples #triples #tokens #vocab knowledge base, Freebase (Bollacker et al., 2008), WebNLG 13,036 2.54 15.26 1,484 and designed an automated process to extract a GenWiki 1.3M 1.95 21.46 476,341 relevant subgraph from it for each Wikipedia ar- Ours 23,522 48.3 3,533.8 238,071 ticle. To make the text generation results on our dataset directly comparable to the state-of-the-art, Table 1: Our dataset contains significantly larger graphs (average #triples per graph) and longer text (av- we chose the set of Wikipedia articles from the es- erage #tokens per text) than previous KG-text datasets. tablished language modeling benchmark WikiText- 103 (Merity et al., 2016), which contains a subset of high-quality Wikipedia articles. This gives us a 2020). Therefore previous KG-text paired datasets dataset of 23,522 graph-text pairs in total, covering that rely on human annotation have limited scale. 82.3% of Wikitext-103 articles. On average each Among these, Gardent et al.(2017) crowdsourced graph has 38.7 nodes and 48.3 edges, and each human annotators to verbalize RDF triplets taken text article contains 3,533.8 tokens. In addition to from DBpedia (Auer et al., 2007) to a few sentences structural information, our graphs also contain rich (WebNLG) and this caused errors in annotation text information with an average of 895.1 tokens in that were fixed with a few updates through years. each graph. Furthermore, the automatic process we Parikh et al.(2020) paired Wikipedia Table with used to create this dataset can be extended to pair one sentence text that is created by annotators that any Wikipedia document with Freebase, and can revise Wikipedia text. be scaled up to create over 3M graph-text pairs. Another line of research focuses on eliminating Out of many exciting new tasks that this dataset the need of human annotations by automatically enables, we present 3 possibilities: graph ! text matching KG-text pairs or generating KGs from generation, graph ! text retrieval, and text ! text using existing tools. Lebret et al.(2016) au- graph retrieval. We benchmarked a few baseline tomatically matched Wikipedia infobox of biogra- models on these tasks. The models we consid- phies with their first sentence. Koncel-Kedziorski ered were based on the recent Transformer-XL (Dai et al.(2019) utilized an earlier information extrac- et al., 2019) model, and we adapted it to condition tion system that extracts entities, co-reference and the text generation on the KG in different ways. relations from given text to build KG’s. The Gen- Our results show that better conditioning on the Wiki dataset (Jin et al., 2020) is automatically con- graph indeed improves the relevance of the gener- structed by querying KGs in DBpedia with the title ated text and the retrieval quality. However, there of articles in Wikipedia followed by filtering and is still significant room for improvement on these entity annotation. tasks, which makes this an exciting dataset for re- We construct our WikiGraphs dataset by extract- search. Our data and code for baseline models will ing a subgraph from Freebase (Bollacker et al., be made publicly available. 2008) for each Wikipedia article following a scal- able automatic process. Compared to previous 2 Related work work, our WikiGraphs dataset contains signifi- cantly larger graphs and longer text (Table1). Graph-text paired data There has been a lot of prior work on creating graph-text paired datasets. Models for graph-text paired data Recent state Example applications include generating text sum- of art language models are based on the Trans- maries conditioned on Abstract Meaning Repre- former architecture (Vaswani et al., 2017) that uses sentation graphs (Liu et al., 2018), generating the the self attention mechanism. The Transformer- abstract of a scientific article given a KG and ti- XL (Dai et al., 2019) model further introduces a tle (Koncel-Kedziorski et al., 2019) and generating segment level recurrence with a novel positional text from RDF triples (Gardent et al., 2017; Jin encoding resulting in impressive performance in et al., 2020). In the following we will mostly re- long sequences by capturing dependencies beyond view related work on KG - text paired datasets. a fixed length window. Annotating KG or text to create paired datasets Graph neural networks (GNNs) (Battaglia et al., is expensive, as a good quality annotation requires 2018; Gilmer et al., 2017) learn representations annotators that understand the content and structure for graph structured data through a message pass- of the text and the corresponding KG (Jin et al., ing process. This class of models naturally exploit 68 Train Valid Test All they appear as plain text without any markup tags. Num. pairs 23,431 48 43 23,522 % of WikiText-103 82.3% 80.0% 71.7% 82.3% As will be described in Section 3.2, we try to Nodes per graph 38.7 35.4 40.6 38.7 pair each article with a subgraph from Freebase, Edges per graph 48.3 42.8 49.5 48.3 centered at the entity node that has a Wikipedia link Avg.