Arxiv:2004.14813V2 [Cs.CL] 26 Oct 2020 and the Description May Only Cover the Most Signif- Knowledge Into Comprehensive Natural Language, Icant Knowledge

ENT-DESC: Entity Description Generation by Exploring Knowledge Graph Liying Cheng∗1,2, Dekun Wuy3, Lidong Bing2, Yan Zhangy1, Zhanming Jiey1, Wei Lu1, Luo Si2 1 Singapore University of Technology and Design 2 DAMO Academy, Alibaba Group 3 York University, Canada fliying.cheng, l.bing, [email protected], [email protected], fyan zhang, zhanming [email protected], [email protected] Abstract funk Bruno Mars genre Previous works on knowledge-to-text genera- country of country of origin retro style, funk, citizenship hip hop tion take as input a few RDF triples or key- rhythm and blues, performer US value pairs conveying the knowledge of some hip hop music, ... Uptown Funk entities to generate a natural language descrip- knowledge graph tion. Existing datasets, such as WIKIBIO, Peter Gene Hernandez (born October 8, 1985), known professionally as WebNLG, and E2E, basically have a good Bruno Mars, is an American singer, songwriter, multi-instrumentalist, alignment between an input triple/pair set and record producer, and dancer. He is known for his stage performances, its output text. However, in practice, the input retro showmanship and for performing in a wide range of musical knowledge could be more than enough, since styles, including R&B, funk, pop, soul, reggae, hip hop, and rock. the output description may only cover the most Figure 1: An example showing our proposed task. significant knowledge. In this paper, we introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text. Our dataset involves retrieving WebNLG (Gardent et al., 2017), key-value pairs of abundant knowledge of various types of main WIKIBIO (Lebret et al., 2016) and E2E (Novikova entities from a large knowledge graph (KG), et al., 2017), to generate natural text describing the which makes the current graph-to-sequence input knowledge. In essence, the task can be formu- models severely suffer from the problems of information loss and parameter explosion while lated as follows: given a main entity, its one-hop at- generating the descriptions. We address these tributes/relations (e.g., WIKIBIO and E2E), and/or challenges by proposing a multi-graph struc- multi-hop relations (e.g., WebNLG), the goal is to ture that is able to represent the original graph generate a text description of the main entity de- information more comprehensively. Further- scribing its attributes and relations. Note that these more, we also incorporate aggregation meth- existing datasets basically have a good alignment ods that learn to extract the rich graph informa- between an input knowledge set and its output text. tion. Extensive experiments demonstrate the Obtaining such data with good alignment could effectiveness of our model architecture. 1 be a laborious and expensive annotation process. 1 Introduction More importantly, in practice, the knowledge regarding the main entity could be more than enough, KG-to-text generation, automatically converting arXiv:2004.14813v2 [cs.CL] 26 Oct 2020 and the description may only cover the most signif- knowledge into comprehensive natural language, icant knowledge. Thereby, the generation model is an important task in natural language process- should have such differentiation capability. ing (NLP) and user interaction studies (Daml- janovic et al., 2010). Specifically, the task takes In this paper, we tackle an entity description as input some structured knowledge, such as re- generation task by exploring KG in order to work source description framework (RDF) triples of towards more practical problems. Specifically, the aim is to generate a description with one or more ∗Liying Cheng is under the Joint Ph.D. Program between Alibaba and Singapore University of Technology and Design. sentences for a main entity and a few topic-related y Dekun Wu was a visiting student at SUTD. Yan Zhang entities, which is empowered by the knowledge and Zhanming Jie were interns at Alibaba. from a KG for a more natural description. In 1Our code and data are available at https://github.com/LiyingCheng95/ order to facilitate the study, we introduce a new EntityDescriptionGeneration. dataset, namely entity-to-description (ENT-DESC) extracted from Wikipedia and Wikidata, which con- experiments are conducted on both our dataset and tains over 110k instances. Each sample is a triplet, benchmark dataset (i.e., WebNLG). MGCN outper- containing a set of entities, the explored knowledge forms several strong baselines, which demonstrates from a KG, and the description. Figure1 shows the effectiveness of our techniques, especially when an example to generate the description of the main using fewer GCN layers. entity, i.e., Bruno Mars, given some relevant key- Our main contributions include: words, i.e., retro style, funk, etc., which are called • We construct a large-scale dataset ENT-DESC topic-related entities of Bruno Mars. We intend to for a more practical task of entity description generate the short paragraph below to describe the generation by exploring KG. To the best of our main entity in compliance with the topic revealed knowledge, ENT-DESC is the largest dataset by topic-related entities. For generating accurate of KG-to-text generation. descriptions, one challenge is to extract the underly- • We propose a multi-graph structure transforma- ing relations between the main entity and keywords, tion approach that explicitly expresses a more as well as the peripheral information of the main comprehensive and more accurate graph infor- entity. In our dataset, we use such knowledge re- mation, in order to overcome limitations asso- vealed in a KG, i.e., the upper right in Figure1 ciated with Levi graphs. with partially labeled triples. Therefore, to some extent, our dataset is a generalization of existing • Experiments and analysis on our new dataset KG-to-text datasets. The knowledge, in the form show that our proposed MGCN model incor- of triples, regarding the main entity and topic en- porated with aggregation methods outperforms tities is automatically extracted from a KG, and strong baselines by effectively capturing and such knowledge could be more than enough and aggregating multi-graph information. not necessarily useful for generating the output. 2 Related Work Our dataset is not only more practical but also Dataset and Task. There is an increasing num- more challenging due to lack of explicit alignment ber of new datasets and tasks being proposed in between the input and the output. Therefore, some recent years as more attention has been paid to knowledge is useful for generation, while others data-to-text generation. Gardent et al.(2017) in- might be noise. In such a case that many different troduced the WebNLG challenge, which aimed to relations from the KG are involved, standard graph- generate text from a small set of RDF knowledge to-sequence models suffer from the problem of low triples (no more than 7) that are well-aligned with training speed and parameter explosion, as edges the text. To avoid the high cost of preparing such are encoded in the form of parameters. Previous well-aligned data, researchers also studied how to work deals with this problem by transforming the leverage automatically obtained partially-aligned original graphs into Levi graphs (Beck et al., 2018). data in which some portion of the output text can- However, Levi graph transformation only explicitly not be generated from the input triples (Fu et al., represents the relations between an original node 2020b). Koncel-Kedziorski et al.(2019) introduced and its neighbor edges, while the relations between AGENDA dataset, which aimed to generate paper two original nodes are learned implicitly through abstract from a title and a small KG built by infor- graph convolutional networks (GCN). Therefore, mation extraction system on the abstracts and has more GCN layers are required to capture such in- at most 7 relations. In our work, we directly create formation (Marcheggiani and Perez-Beltrachini, a knowledge graph for the main entities and topic- 2018). As more GCN layers are being stacked, related entities from Wikidata without looking at it suffers from information loss from KG (Abu-El- the relations in our output. Scale-wise, our dataset Haija et al., 2018). In order to address these limita- consists of 110k instances while AGENDA is 40k. tions, we present a multi-graph convolutional net- Lebret et al.(2016) introduced W IKIBIO dataset works (MGCN) architecture by introducing multi- that generates the first sentence of biographical ar- graph transformation incorporated with an aggre- ticles from the key-value pairs extracted from the gation layer. Multi-graph transformation is able article’s infobox. Novikova et al.(2017) introduced to represent the original graph information more E2E dataset in the restaurant domain, which aimed accurately, while the aggregation layer learns to to generate restaurant recommendations given 3 to extract useful information from the KG. Extensive 8 slot-value pairs. These two datasets were only for a single domain, while ours focuses on multiple WebNLG AGENDA E2E ENT-DESC domains of over 100 categories, including people, # instances 43K 41K 51K 110K event, location, organization, etc. Another differ- Input vocab 4.4K 54K 120 420K Output vocab 7.8K 78K 5.2K 248K ence is that we intend to generate the first paragraph # distinct entities 3.1K 297K 77 691K of each Wikipedia article from a more complicated # distinct relations 358 7 8 957 Avg. # triples per input 3.0 4.4 5.6 27.4 KG, but not key-value pairs. Another popular task Avg. # words per output 23.7 141.3 20.3 31.0 is AMR-to-text generation (Konstas et al., 2017). The structure of AMR graphs is rooted and denser, Table 1: Dataset statistics of WebNLG, AGENDA and which is quite different from the KG-to-text task.

Load more