Category-Driven Content Selection

Edinburgh Research Explorer Category-Driven Content Selection Citation for published version: Mohammed, R, Perez-Beltrachini, L & Gardent, C 2016, Category-Driven Content Selection. in Proceedings of The 9th International Natural Language Generation conference . Association for Computational Linguistics, pp. 94-98, 9th International Natural Language Generation conference , Edinburgh, United Kingdom, 5/09/16. https://doi.org/10.18653/v1/W16-6616 Digital Object Identifier (DOI): 10.18653/v1/W16-6616 Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Proceedings of The 9th International Natural Language Generation conference General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 01. Oct. 2021 Category-Driven Content Selection Rania Mohamed Sayed Laura Perez-Beltrachini Claire Gardent Universite´ de Lorraine CNRS/LORIA CNRS/LORIA Nancy (France) Nancy (France) Nancy (France) [email protected] [email protected] [email protected] Abstract or monuments). We introduce a content selection method which, given an entity, retrieves from In this paper, we introduce a content selection DBPedia an RDF subgraph that encodes relevant method where the communicative goal is to describe entities of different categories (e.g., and coherent knowledge about this entity. Our ap- astronauts, universities or monuments). We proach differs from previous work in that it lever- argue that this method provides an interesting ages the categorial information provided by large basis both for generating descriptions of enti- scale knowledge bases about entities of a given type. ties and for semi-automatically constructing a Using n-gram models of the RDF(S) properties oc- benchmark on which to train, test and compare curring in the RDF(S) graphs associated with enti- data-to-text generation systems. ties of the same category , we select for a given entity of category C, a subgraph with maximal n-gram 1 Introduction probability that is, a subgraph which contains prop- With the development of the Linked Open Data erties that are true of that entity, that are typical of framework (LOD1 ), a considerable amount of that category and that support the generation of a RDF(S) data is now available on the Web. While coherent text. this data contains a wide range of interesting fac- 2 Method tual and encyclopedic knowledge, the RDF(S) format in which it is encoded makes it difficult to ac- Given an entity e of category C and its associated cess by lay users. Natural Language Generation DBPedia entity graph Ge, our task is to select a (tar- (NLG) would provide a natural means of address- get) subgraph Te of Ge such that: ing this shortcoming. It would permit, for instance, T is relevant: the DBPedia properties con- enriching existing texts with encyclopaedic informa- • e tion drawn from linked data sources such as DBPe- tained in Te are commonly (directly or indi- dia; or automatically creating a wikipedia stub for an rectly) associated with entities of type C instance of an ontology from the associated linked T maximises global coherence: DBPedia en- data. Conversely, because of its well-defined syntax • e C and semantics, the RDF(S) format in which linked tries that often co-occur in type are selected data is encoded provides a natural ground on which together to develop, test and compare Natural Language Gen- Te supports local coherence: the set of DBPe- eration (NLG) systems. • dia triples contained in Te capture a sequence In this paper, we focus on content selection from of entity-based transitions which supports the RDF data where the communicative goal is to de- generation of locally coherent texts i.e., texts scribe entities of various categories (e.g., astronauts such that the propositions they contain are re- 1http://lod-cloud.net/ lated through shared entities. 94 Proceedings of The 9th International Natural Language Generation conference, pages 94–98, Edinburgh, UK, September 5-8 2016. c 2016 Association for Computational Linguistics Category Nb.Entities Nb.Triples Nb.Properties Entity Depth1 Depth2 Astronaut 110 1664033 4167 e1 14 24 Monument 500 818145 6521 e2 21 32 University 500 969541 7441 Astronaut e3 16 28 e4 12 24 Table 1: Category Graphs e5 15 22 e1 13 18 e2 20 21 To provide a content selection process which im- Monument e3 7 14 plements these constraints, we proceeds in three e4 6 14 main steps. e5 4 11 First, we build n-gram models of properties for e1 6 20 DBPedia categories. That is, we define the probabil- e2 13 21 ity of 1-, 2- and 3-grams of DBPedia properties for University e3 6 10 a given category. e4 9 16 Second, we extract from DBPedia, entity graphs e5 27 34 of depth four. Table 2: Entity Graphs Third, we use the n-gram models of DBPedia properties and Integer Linear Programming (ILP) 2.3 Selecting DBPedia Subgraphs to identify subtrees of entity graphs with maximal To retrieve subtrees of DBPedia subgraphs which probability. Intuitively, we select subtrees of the are maximally coherent, we use an the following ILP entity graph which are relevant (the properties they model. contain are frequent for that category), which are locally coherent (the tree constraints ensure that the Representing tuples Given an entity graph Ge for selected triples are related by entity sharing) and that the DBPedia entity e of category C (e.g. Astronaut), are globally coherent (the use of bi- and tri-gram for each triple t = (s, p, o) in Ge, we introduce a p probabilities supports the selection of properties that binary variable xs,o such that: frequently co-occur in the graphs of entities of that category). p 1 if the tuple is preserved xt = x = s,o 0 otherwise 2.1 Building n-gram models of DBPedia ( properties. Because we use 2- and 3-grams to capture global coherence (properties that often co-occur together), To build the n-gram models, we extract from DB- we also have variables for bi-grams and trigrams of Pedia the graphs associated with all entities of those tuples. For bigrams, these variables capture triples categories up to depth 4. Table 1 shows some statis- which share an entity (either the object of one is the tics for these graphs. We build the n-gram models subject of the other or they share the same subject). using the SRILM toolkit. To experiment with var- So for each bigram of triples t = (s1, p1, o1) and ious versions of n-gram information, we create for 1 t2 = (s2, p2, o2) in G such that o1 = s2, o2 = s1 each category, 1-, 2- and 3-grams of DBPedia prop- e or s1 = s2, we introduce a binary variable y erties. t1,t2 such that: 2.2 Building Entity Graphs. 1 if the pair of triples is preserved For each of the three categories, we then extract yt ,t = 1 2 0 otherwise from DBPedia the graphs associated with 5 entities ( considering RDF triples up to depth two. Table 2 Similarly, there is a trigram binary variable shows the statistics for each entity depending on the zt1,t2,t3 for each connected set of triples t1, t2, t3 in depth of the graph. Ge such that: 95 Model Selected Triples Baseline Elliot See birthDate ”1927-07-23” 1 if the trigram of triples is preserved Elliot See birthPlace Dallas zt1,t2,t3 = Elliot See almaMater University of Texas at Austin (0 otherwise Elliot See source ”See’s feelings about ...” Elliot See status ”Deceased” Maximising Relevance and Coherence To max- Elliot See deathPlace St. Louis imise relevance and coherence, we seek to find a 1-Gram Elliot See birthPlace Dallas subtree of the input graph G which maximises the Elliot See nationality United States e Elliot See almaMater University of Texas at Austin following objective function: Elliot See rank United States Navy Reserve Elliot See mission ”None” Elliot See deathPlace St. Louis S(X) = x xt .P (p) 2-Gram Elliot See birthDate ”1927-07-23” Elliot See birthPlace Dallas + y Yti,tj .B(ti, tj) (1) P Elliot See nationality United States + Zt ,t ,t .T (ti, tj, t ) Pz i j k k Elliot See almaMater University of Texas at Austin P Elliot See status ”Deceased” where P (p), the unigram probability of p in enti- Elliot See deathPlace St. Louis ties of category C, is defined as follows, let Tc be the 3-Gram Elliot See birthDate ”1927-07-23” set of triples occurring in the entity graphs (depth 2) Elliot See birthPlace Dallas Elliot See almaMater University of Texas at Austin of all DBPedia entities of category C. Let Pc be the Elliot See deathDate ”1966-02-28” set of properties occurring in Tc and let count(p,C) Elliot See status ”Deceased” be the number of time p occurs in Tc, then: Elliot See deathPlace St. Louis Example content selections count(p,C) Table 3: P (p) = i count(pi,C) where X is the set of words that occur in the so- Similarly, B(t , t ) and T (t , t , t ) are the 2- and i j P i j k lution (except the root node).

Category-Driven Content Selection

USGS Open-File Report 2005-1190, Table 1

For Further Information, Contact John T. Colby Jr., Publisher at [email protected]

Building RDF Content for Data-To-Text Generation

Congressional Record—Senate S3265

Illustrations

11.Astronaut Geology Training

Congressional Record—Senate S6761

Denying the Apollo Moon Landings: Conspiracy and Questioning in Modern American History

A Space Chronology

Interview Transcript

Kennedy Makes MSC Third Stop in 'Space Tour' MSC Names Nine

National Archives National Personnel Records Center (NPRC) VIP List, 2009