Edinburgh Research Explorer

Category-Driven Content Selection

Citation for published version: Mohammed, R, Perez-Beltrachini, L & Gardent, C 2016, Category-Driven Content Selection. in Proceedings of The 9th International Natural Language Generation conference . Association for Computational Linguistics, pp. 94-98, 9th International Natural Language Generation conference , Edinburgh, United Kingdom, 5/09/16. https://doi.org/10.18653/v1/W16-6616

Digital Object Identifier (DOI): 10.18653/v1/W16-6616

Link: Link to publication record in Edinburgh Research Explorer

Document Version: Publisher's PDF, also known as Version of record

Published In: Proceedings of The 9th International Natural Language Generation conference

General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim.

Download date: 01. Oct. 2021 Category-Driven Content Selection

Rania Mohamed Sayed Laura Perez-Beltrachini Claire Gardent Universite´ de Lorraine CNRS/LORIA CNRS/LORIA Nancy (France) Nancy (France) Nancy (France) [email protected] [email protected] [email protected]

Abstract or monuments). We introduce a content selec- tion method which, given an entity, retrieves from In this paper, we introduce a content selection DBPedia an RDF subgraph that encodes relevant method where the communicative goal is to describe entities of different categories (e.g., and coherent knowledge about this entity. Our ap- , universities or monuments). We proach differs from previous work in that it lever- argue that this method provides an interesting ages the categorial information provided by large basis both for generating descriptions of enti- scale knowledge bases about entities of a given type. ties and for semi-automatically constructing a Using n-gram models of the RDF(S) properties oc- benchmark on which to train, test and compare curring in the RDF(S) graphs associated with enti- data-to-text generation systems. ties of the same category , we select for a given en- tity of category C, a subgraph with maximal n-gram 1 Introduction probability that is, a subgraph which contains prop- With the development of the Linked Open Data erties that are true of that entity, that are typical of framework (LOD1 ), a considerable amount of that category and that support the generation of a RDF(S) data is now available on the Web. While coherent text. this data contains a wide range of interesting fac- 2 Method tual and encyclopedic knowledge, the RDF(S) for- mat in which it is encoded makes it difficult to ac- Given an entity e of category C and its associated cess by lay users. Natural Language Generation DBPedia entity graph Ge, our task is to select a (tar- (NLG) would provide a natural means of address- get) subgraph Te of Ge such that: ing this shortcoming. It would permit, for instance, T is relevant: the DBPedia properties con- enriching existing texts with encyclopaedic informa- • e tion drawn from linked data sources such as DBPe- tained in Te are commonly (directly or indi- dia; or automatically creating a wikipedia stub for an rectly) associated with entities of type C instance of an ontology from the associated linked T maximises global coherence: DBPedia en- data. Conversely, because of its well-defined syntax • e C and semantics, the RDF(S) format in which linked tries that often co-occur in type are selected data is encoded provides a natural ground on which together to develop, test and compare Natural Language Gen- Te supports local coherence: the set of DBPe- eration (NLG) systems. • dia triples contained in Te capture a sequence In this paper, we focus on content selection from of entity-based transitions which supports the RDF data where the communicative goal is to de- generation of locally coherent texts i.e., texts scribe entities of various categories (e.g., astronauts such that the propositions they contain are re- 1http://lod-cloud.net/ lated through shared entities. 94

Proceedings of The 9th International Natural Language Generation conference, pages 94–98, Edinburgh, UK, September 5-8 2016. c 2016 Association for Computational Linguistics Category Nb.Entities Nb.Triples Nb.Properties Entity Depth1 Depth2 110 1664033 4167 e1 14 24 Monument 500 818145 6521 e2 21 32 University 500 969541 7441 Astronaut e3 16 28 e4 12 24 Table 1: Category Graphs e5 15 22 e1 13 18 e2 20 21 To provide a content selection process which im- Monument e3 7 14 plements these constraints, we proceeds in three e4 6 14 main steps. e5 4 11 First, we build n-gram models of properties for e1 6 20 DBPedia categories. That is, we define the probabil- e2 13 21 ity of 1-, 2- and 3-grams of DBPedia properties for University e3 6 10 a given category. e4 9 16 Second, we extract from DBPedia, entity graphs e5 27 34 of depth four. Table 2: Entity Graphs Third, we use the n-gram models of DBPedia properties and Integer Linear Programming (ILP) 2.3 Selecting DBPedia Subgraphs to identify subtrees of entity graphs with maximal To retrieve subtrees of DBPedia subgraphs which probability. Intuitively, we select subtrees of the are maximally coherent, we use an the following ILP entity graph which are relevant (the properties they model. contain are frequent for that category), which are lo- cally coherent (the tree constraints ensure that the Representing tuples Given an entity graph Ge for selected triples are related by entity sharing) and that the DBPedia entity e of category C (e.g. Astronaut), are globally coherent (the use of bi- and tri-gram for each triple t = (s, p, o) in Ge, we introduce a p probabilities supports the selection of properties that binary variable xs,o such that: frequently co-occur in the graphs of entities of that category). p 1 if the tuple is preserved xt = x = s,o 0 otherwise 2.1 Building n-gram models of DBPedia ( properties. Because we use 2- and 3-grams to capture global coherence (properties that often co-occur together), To build the n-gram models, we extract from DB- we also have variables for bi-grams and trigrams of Pedia the graphs associated with all entities of those tuples. For bigrams, these variables capture triples categories up to depth 4. Table 1 shows some statis- which share an entity (either the object of one is the tics for these graphs. We build the n-gram models subject of the other or they share the same subject). using the SRILM toolkit. To experiment with var- So for each bigram of triples t = (s1, p1, o1) and ious versions of n-gram information, we create for 1 t2 = (s2, p2, o2) in G such that o1 = s2, o2 = s1 each category, 1-, 2- and 3-grams of DBPedia prop- e or s1 = s2, we introduce a binary variable y erties. t1,t2 such that: 2.2 Building Entity Graphs. 1 if the pair of triples is preserved For each of the three categories, we then extract yt ,t = 1 2 0 otherwise from DBPedia the graphs associated with 5 entities ( considering RDF triples up to depth two. Table 2 Similarly, there is a trigram binary variable shows the statistics for each entity depending on the zt1,t2,t3 for each connected set of triples t1, t2, t3 in depth of the graph. Ge such that: 95 Model Selected Triples Baseline Elliot See birthDate ”1927-07-23” 1 if the trigram of triples is preserved Elliot See birthPlace zt1,t2,t3 = Elliot See almaMater University of at Austin (0 otherwise Elliot See source ”See’s feelings about ...” Elliot See status ”Deceased” Maximising Relevance and Coherence To max- Elliot See deathPlace St. Louis imise relevance and coherence, we seek to find a 1-Gram Elliot See birthPlace Dallas subtree of the input graph G which maximises the Elliot See nationality United States e Elliot See almaMater University of Texas at Austin following objective function: Elliot See rank United States Navy Reserve Elliot See mission ”None” Elliot See deathPlace St. Louis S(X) = x xt .P (p) 2-Gram Elliot See birthDate ”1927-07-23” Elliot See birthPlace Dallas + y Yti,tj .B(ti, tj) (1) P Elliot See nationality United States + Zt ,t ,t .T (ti, tj, t ) Pz i j k k Elliot See almaMater University of Texas at Austin P Elliot See status ”Deceased” where P (p), the unigram probability of p in enti- Elliot See deathPlace St. Louis ties of category C, is defined as follows, let Tc be the 3-Gram Elliot See birthDate ”1927-07-23” set of triples occurring in the entity graphs (depth 2) Elliot See birthPlace Dallas Elliot See almaMater University of Texas at Austin of all DBPedia entities of category C. Let Pc be the Elliot See deathDate ”1966-02-28” set of properties occurring in Tc and let count(p,C) Elliot See status ”Deceased” be the number of time p occurs in Tc, then: Elliot See deathPlace St. Louis Example content selections count(p,C) Table 3: P (p) = i count(pi,C) where X is the set of words that occur in the so- Similarly, B(t , t ) and T (t , t , t ) are the 2- and i j P i j k lution (except the root node). This constraint makes 3-gram probability P (t t ) and P (t t t ). 2| 1 3| 1 2 sure that if o has a child then it also has a head. The Consistency Constraints We ensure consistency first part of Eq 3 counts the number of head proper- between the unary and the binary variables so that ties. The second part counts the children of p which if a bigram is selected then so are the corresponding could be greater than 0. It is therefore normalised triples: with X to make it less than 1. And then the differ- ence should be greater than 0. i, j, yi,j xi ∀ ≤ Restricting the size of the resulting tree Solu- tions are constrained to contain α tuples. i, j, yi,j xj ∀ ≤ p xs,o = α (4) x yi,j + (1 xi) + (1 xj) 1 X − − ≥ Ensuring Local Coherence (Tree Shape) Solu- 3 Discussion tions are constrained to be trees by requiring that Table 3 shows content selections which illustrate the each object has at most one subject (eq. 2) and all main differences between four models, a baseline tuples are connected (eq. 3). model with uniform n-gram probability versus a un- igram, a bigram and a 3-gram model. o X, xp 1 (2) ∀ ∈ s,o ≤ The baseline model tends to generate solutions s,p X with little cohesion between triples. Facts are enu- merated which each range over distinct topics (e.g., 1 o X, xp xp 0 (3) birth date and place, place of study, status and ∀ ∈ s,o − X o,u ≥ s,p u,p deathplace). It may also include properties such as X | | X 96 Dead Man’s Plack location England induced from a training set aligning frames and sen- England capital London tences and used to generate using a beam search that England establishedEvent Acts of Union 1707 England religion Church of England uses weighted features learned from the training data Dead Man’s Plack dedicatedTo Athelwald to rank alternative expansions at each step. Dead Man’s Plack monumentName ”Dead Man’s Plack” More recently, data-to-text generators (Angeli et Dead Man’s Plack material Rock al., 2010; Chen and Mooney, 2008; Wong and Table 4: Output of Depth 2 Mooney, 2007; Konstas and Lapata, 2012b; Kon- stas and Lapata, 2012a) were trained and developed “source” which are generic rather than specific to the on data-to-text corpora from various domains in- type of entity being described. cluding the air travel domain (Dahl et al., 1994), The 1-gram model is similar to the baseline in that weather forecasts (Liang et al., 2009; Belz, 2008) it often generates solutions which are simple enu- and sportscasting (Chen and Mooney, 2008). merations of facts belonging to various topics (birth Creating such data-to-text corpora is however dif- place, nationality, place of study, rank in the army, ficult, time consuming and non generic. Contrary space mission, death place). Contrary to the base- to parsing where resources such as the Penn Tree- line solutions however, each selected fact is strongly bank succeeded in boosting research, natural lan- characteristic of the entity type. guage generation still suffers from a lack of common The 2- and 3-gram models tend to yield more co- reference on which to train and evaluate parsers. Us- herent solutions in that they often contain sets of ing crowdsourcing and the content selection method topically related properties (e.g., birth date and birth presented here, we plan to construct a large bench- place; death date and date place). mark on which data-to-text generators can be trained and tested. 4 Conclusion 5 Acknowledgments We have presented a method for content selection from DBPedia data which supports the selection of We thank the French National Research Agency for semantically varied content units of different sizes. funding the research presented in this paper in the While the approach yields good results, one short- context of the WebNLG project. coming is that most of the selected subtrees are trees of depth 1 and that moreover, trees of depth 2 have References limited coherence. For instance, the 1-gram model generates the solution shown in Table 4 where the Gabor Angeli, Percy Liang, and Dan Klein. 2010. A triples about England are not particularly relevant to simple domain-independent probabilistic approach to the description of the Deam Man’s Plack’s monu- generation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, ment. More generally, bi- and 3-grams mostly seem pages 502–512. Association for Computational Lin- to trigger the selection of 2- and 3-grams that are guistics. directly related to the target entity rather than chains Anja Belz. 2008. Automatic generation of of triples. We are currently investigating whether the weather forecast texts using comprehensive probabilis- use of interpolated models could help resolve this is- tic generation-space models. Natural Language Engi- sue. neering, 14(4):431–455. Another important point we are currently inves- David L Chen and Raymond J Mooney. 2008. Learning tigating concerns the creation of a benchmark for to sportscast: a test of grounded language acquisition. Natural Language Generation. Most existing work In Proceedings of the 25th international conference on Machine learning, pages 128–135. ACM. on data-to-text generation rely on a parallel or com- Deborah A Dahl, Madeleine Bates, Michael Brown, parable data-to-text corpus. William Fisher, Kate Hunicke-Smith, David Pallett, To generate from the frames produced by a dialog Christine Pao, Alexander Rudnicky, and Elizabeth system, (DeVault et al., 2008) describes an approach Shriberg. 1994. Expanding the scope of the atis task: in which a probabilistic Tree Adjoining Grammar is The atis-3 corpus. In Proceedings of the workshop on 97 Human Language Technology, pages 43–48. Associa- tion for Computational Linguistics. David DeVault, David Traum, and Ron Artstein. 2008. Making grammar-based generation easier to deploy in dialogue systems. In Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, pages 198–207. Association for Computational Linguistics. Ioannis Konstas and Mirella Lapata. 2012a. Concept-to- text generation via discriminative reranking. In Pro- ceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 369–378. Association for Computational Lin- guistics. Ioannis Konstas and Mirella Lapata. 2012b. Unsu- pervised concept-to-text generation with hypergraphs. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 752–761. Association for Computational Lin- guistics. Percy Liang, Michael I Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervi- sion. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Interna- tional Joint Conference on Natural Language Process- ing of the AFNLP: Volume 1-Volume 1, pages 91–99. Association for Computational Linguistics. Yuk Wah Wong and Raymond J Mooney. 2007. Genera- tion by inverting a semantic parser that uses statistical machine translation. In HLT-NAACL, pages 172–179.

98