Generating Text from Dbpedia Data

The WebNLG Challenge: Generating Text from DBPedia Data Emilie Colin1 Claire Gardent1 Yassine M’rabet2 Shashi Narayan3 Laura Perez-Beltrachini1 1 CNRS/LORIA and Universite´ de Lorraine, Nancy, France emilie.colin,claire.gardent,laura.perez @loria.fr { 2 National Library of Medicine, Bethesda, USA} yassine.m’[email protected] 3 School of Informatics, University of Edinburgh, UK [email protected] 1 Introduction ties with typed literal values.1 There are several motivations for generating text With the emergence of the linked data initiative and from DBPedia. the rapid development of RDF (Resource Descrip- First, the RDF language in which DBPedia is en- tion Format) datasets, several approaches have re- coded is widely used within the Linked Data frame- cently been proposed for generating text from RDF work. Many large scale datasets are encoded in this data (Sun and Mellish, 2006; Duma and Klein, 2013; language (e.g., MusicBrainz2, FOAF3, LinkedGeo- Bontcheva and Wilks, 2004; Cimiano et al., 2013; Data4) and official institutions5 increasingly publish Lebret et al., 2016). To support the evaluation and their data in this format. Being able to generate good comparison of such systems, we propose a shared quality text from RDF data would permit e.g., mak- task on generating text from DBPedia data. The ing this data more accessible to lay users, enriching training data will consist of Data/Text pairs where existing text with information drawn from knowl- the data is a set of triples extracted from DBPe- edge bases such as DBPedia or describing, compar- dia and the text is a verbalisation of these triples. ing and relating entities present in these knowledge In essence, the task consists in mapping data to bases. text. Specific subtasks include sentence segmenta- Second, RDF data, and in particular, DBPedia, tion (how to chunk the input data into sentences), provide a framework that is both limited and arbi- lexicalisation (of the DBPedia properties), aggrega- trarily extensible from a linguistic point of view. In tion (how to avoid repetitions) and surface realisa- the simplest case, the goal would be to verbalise tion (how to build a syntactically correct and natural a single triple. In that case, the task mainly con- sounding text). sists in finding an appropriate “lexicalisation” for the property. The complexity of the generation task 2 Context and Motivation can be closely monitored however by increasing the number of input triples, using input with different DBPedia is a multilingual knowledge base that was shapes6, working with different semantic domains built from various kinds of structured information and/or enriching the RDF graphs with additional contained in Wikipedia (Mendes et al., 2012). This data is stored as RDF triples of the form (SUBJECT, 1http://wiki.dbpedia.org/ PROPERTY, OBJECT) where the subject is a URI (Uni- dbpedia-dataset-version-2015-10 form Resource Identifier), the property is a binary 2https://musicbrainz.org/ relation and the object is either a URI or a literal 3http://www.foaf-project.org/ 4 value such as a string, a date or a number. The En- http://linkedgeodata.org/ 5See http://museum-api.pbworks.com for exam- glish version of the DBpedia knowledge base cur- ples. rently encompasses 6.2M entities, 739 classes, 1,099 6DBPedia data forms a graph. Different graph shapes induce properties with reference values and 1,596 proper- different verbalisation structures. 163 Proceedings of The 9th International Natural Language Generation conference, pages 163–167, Edinburgh, UK, September 5-8 2016. c 2016 Association for Computational Linguistics (e.g., discourse) information. We plan to produce propose but our task differs from them in various a dataset which varies along at least some of these ways. dimensions so as to provide a benchmark for gener- KBGen generation challenge. The recent KBGen ation that will test systems on input of various com- (Banik et al., 2013) task focused on sentence genera- plexity. tion from Knowledge Bases (KB). In particular, the Third, there has been much work recently on ap- task was organised around the AURA (Gunning et plying deep learning (in particular, sequence to se- al., 2010) KB on the biological domain which mod- quence) models to generation. The training data els n-ary relations. The input data selection process used by these approaches however often have lim- targets the extraction of KB fragments which could ited variability. For instance, (Wen et al., 2015)’s be verbalised as a single sentence. The content se- data is restricted to restaurant descriptions and (Le- lection approach was semi-automatic, starting with bret et al., 2016)’s to WikiData frames. Typically the the manual selection of a set of KB fragments. Then, number of attributes (property) considered by these using patterns derived from those fragments, a new approaches is very low (between 15 and 40) and set of candidate KB fragments was generated which the text to be produced have a stereotyped structure was finally manually revised. The verbalisation of (restaurant description, biographic abstracts). By the sentence sized KB fragments was generated by providing a more varied dataset, the WebNLG data- human subjects. text corpus will permit investigating how such deep Although our task also concerns text generation learning models perform on more varied and more from KBs the definition of the task is different. Our linguistically complex data. proposal aims at the generation of text beyond sentences and thus involves an additional subtask that 3 Task Description is sentence segmentation. The tasks also differ on In essence, the task consists in mapping data to the KBs used, we propose using DBPedia which fa- text. Specific subtasks include sentence segmenta- cilitates changing the domain by focusing on dif- tion (how to chunk the input data into sentences), ferent categories. Moreover, the set of relations on lexicalisation (of the DBPedia properties), aggrega- both KBs pose different challenges for generation, tion (how to avoid repetitions) and surface realisa- while the AURA KB contains n-ary relations DBPe- tion (how to build a syntactically correct and natu- dia contains relations names challenging for the lex- ral sounding text). The following example illustrates icalisation subtask. A last difference with our task is this. the content selection method. Our method is com- pletely automatic and thus permits the inexpensive (1) a. Data: (JOHN E BLAHA BIRTHDATE 1942 08 26) generation of a large benchmark. Moreover, it can (JOHN E BLAHA BIRTHPLACE SAN ANTONIO) (JOHN E BLAHA OCCUPATION FIGHTER PILOT) be used to select content ranging from a single triple b. Text: John E Blaha, born in San Antonio on 1942-08- to several triples and with different shapes. 26, worked as a fighter pilot The Surface Realisation Shared Task (SR’11). The Given the input shown in (1a), generating (1b) in- major goal of the SR’11 task (Belz et al., 2011) volves lexicalising the OCCUPATION property as the was to provide a common ground for the compari- phrase worked as, using PP coordination (born in San son of surface realisers on the task of regenerating Antonio on 1942-08-26) to avoid repeating the word born sentences in a treebank. Two different tracks are (aggregation) and verbalising the 3 triples by a sin- considered with different input representations. The gle complex sentence including an apposition, a PP ’shallow’ input provides a dependency tree of the coordination and a transitive verb construction (sen- sentence to be generated and the ’deep’ input pro- tence segmentation and surface realisation). vides a graph representation where syntactic depen- Relation to Previous Shared Tasks Other NLG dencies have been replaced by semantic roles and shared task evaluation challenges have been organ- some function words have been removed. ised in the past. These have focused on different The focus of the SR’11 task was on the linguis- generation subtasks overlapping with the task we tic realisation subtask and the broad coverage of lin- 164 guistic phenomena. The task we propose here starts ILP program is used to extract from DBPedia, sub- from non-linguistic KB data and puts forward other trees that maximise bigram probability. In effect, NLG subtasks. the extracted DBPedia trees are coherent entity descriptions in that the property bigram they contain Generating Referring Expressions (GRE). The GRE often cooccur together in the DBPedia graphs as- shared tasks pioneered the proposed NLG chal- sociated with entities of a given DBPedia category. lenges. The first shared task has only focused on The method can be parameterised to produce con- the selection of distinguishing attributes (Belz and tent units for different DBPedia categories, differ- Gatt, 2007) while subsequent tasks have considered ent DBPedia entities and various numbers of DBPe- the referring expression realisation subtask propos- dia triples. It is fully automatic and permit produc- ing a complete referring expression generation task ing DBPedia graphs that are both coherent, diverse (Gatt et al., 2008; Gatt et al., 2009). This tasks and that bear on different domains (e.g., Astronauts, aimed at the unique identification of the referent and Universities, Musical work). brevity of the referring expression. Slightly differ- Text To associate the DBPedia trees extracted in ent, the GREC challenges (Belz et al., 2008; Belz et the first phase with text, we will combine automatic al., 2009; Belz et al., 2010) propose the generation techniques with crowdsourcing in two ways. of referring expressions in a discourse context. The First, we will lexicalise DBPedia properties by GREC tasks use a corpus created from Wikipedia using the lexicalisations contained in the Lemon abstracts on geographic entities and people and with English Lexicon for DBPedia7(Walter et al., 2013; two referring expression annotation schemes, refer- Walter et al., 2014a; Walter et al., 2014b) and ence type and word strings.

Generating Text from Dbpedia Data

How Do BERT Embeddings Organize Linguistic Knowledge?

Treebanks, Linguistic Theories and Applications Introduction to Treebanks

Senserelate::Allwords - a Broad Coverage Word Sense Tagger That Maximizes Semantic Relatedness

Deep Linguistic Analysis for the Accurate Identification of Predicate

The Procedure of Lexico-Semantic Annotation of Składnica Treebank

Unified Language Model Pre-Training for Natural

Building a Treebank for French

Converting an HPSG-Based Treebank Into Its Parallel Dependency-Based Treebank

Corpus Based Evaluation of Stemmers

Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

Lecture 5: Part-Of-Speech Tagging

Merging Propbank, Nombank, Timebank, Penn Discourse Treebank and Coreference James Pustejovsky, Adam Meyers, Martha Palmer, Massimo Poesio