Leveraging Frequent Query Substructures to Generate Formal Queries for Complex Question Answering

Leveraging Frequent Query Substructures to Generate Formal Queries for Complex Question Answering Jiwei Ding Wei Hu∗ Qixin Xu Yuzhong Qu∗ State Key Laboratory for Novel Software Technology, Nanjing University, China fjwding, [email protected], fwhu, [email protected] Abstract How many movies have the same director as The Shawshank Redemption? COUNT ISA Formal query generation aims to generate cor- ?Var2 ?Var1 dbo:Film rect executable queries for question answer- dbo:director ?Var3 dbr:TSR ing over knowledge bases (KBs), given en- dbo:director tity and relation linking results. Current ap- Figure 1: An example for complex question and query proaches build universal paraphrasing or ranking models for the whole questions, which are represents a complex structure shown in the mid- likely to fail in generating queries for com- dle of Figure1; (ii) recognizing and paraphras- plex, long-tail questions. In this paper, we ing aggregations (e.g., “how many” corresponds propose SubQG, a new query generation ap- to COUNT); and (iii) organizing all the above to proach based on frequent query substructures, generate an executable query (Singh et al., 2018; which helps rank the existing (but nonsignif- Zafar et al., 2018). icant) query structures or build new query structures. Our experiments on two bench- There are mainly two kinds of query generation mark datasets show that our approach signif- approaches for complex questions. (i) Template- icantly outperforms the existing ones, espe- based approaches choose a pre-collected template cially for complex questions. Also, it achieves for query generation (Cui et al., 2017; Abuja- promising performance with limited training bal et al., 2017). Such approaches highly rely data and noisy entity/relation linking results. on the coverage of templates, and perform unsta- 1 Introduction bly when some complex templates have very few natural language questions as training data. (ii) Knowledge-based question answering (KBQA) Approaches based on semantic parsing and neu- aims to answer natural language questions over ral networks learn entire representations for ques- knowledge bases (KBs) such as DBpedia and tions with different query structures, by using a Freebase. Formal query generation is an important neural network following the encode-and-compare component in many KBQA systems (Bao et al., framework (Luo et al., 2018; Zafar et al., 2018). 2016; Cui et al., 2017; Luo et al., 2018), especially They may suffer from the lack of training data, for answering complex questions. Given entity especially for long-tail questions with rarely ap- and relation linking results, formal query gener- peared structures. Furthermore, both above ap- ation aims to generate correct executable queries, proaches cannot handle questions with unseen e.g., SPARQL queries, for the input natural lan- query structures, since they cannot generate new guage questions. An example question and its for- query structures. mal query are shown in Figure1. Generally speak- To cope with the above limitations, we propose ing, formal query generation is expected to include a new query generation approach based on the but not be limited to have the capabilities of (i) following observation: the query structure for a recognizing and paraphrasing different kinds of complex question may rarely appear, but it usu- constraints, including triple-level constraints (e.g., ally contains some substructures that frequently “movies” corresponds to a typing constraint for appeared in other questions. For example, the the target variable) and higher level constraints query structure for the question in Figure1 ap- (e.g., subgraphs). For instance, “the same ... as” pears rarely, however, both “how many movies” ∗Corresponding authors and “the same ... as” are common expressions, 2614 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2614–2622, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics which correspond to the two query substructures How many movies were directed by the graduate of Burbank High School? (a) Query (b) Query structure in dashed boxes. To collect such frequently ap- COUNT ISA COUNT ISA ?Var2 ?Var1 dbo:Film ?Var2 ?Var1 Class1 peared substructures, we automatically decom- dbo:director Prop1 ?Var3 dbr:BHS ?Var3 Ent1 pose query structures in the training data. In- dbp:edu Prop2 stead of directly modeling the query structure for the given question as a whole, we employ multi- ?Var1 ?Var2 ?Var2 Ent1 Class1 Prop1 Ent1 COUNT COUNT Prop2 ple neural networks to predict query substructures Prop2 ISA ?Var3 ?Var1 ?Var1 ?Var3 contained in the question, each of which delivers (c) Query substructures a part of the query intention. Then, we select an Figure 2: Illustration of a query, a query structure and existing query structure for the input question by query substructures using a combinational ranking function. Also, in some cases, no existing query structure is appro- a built-in property or a user-defined one. For sim- priate for the input question. To cope with this plicity, the set of all edge labels are denoted by issue, we merge query substructures to build new Le(Q). In this paper, the built-in properties in- query structures. The contributions of this paper clude COUNT,AVG,MAX,MIN,MAXATN, MI- are summarized below: NATN and ISA(RDF:TYPE), where the former four are used to connect two variables. For ex- • We formalize the notion of query structures ample, h?V ar1; COUNT; ?V ar2i represents that and define the substructure relationship be- ?V ar2 is the counting result of ?V ar1.MAX- tween query structures. ATN and MINATN take the meaning of OR- DER BY in SPARQL (Bao et al., 2016). For in- • We propose a novel approach for formal stance, h?V ar1; MAXATN; 2i means ORDER BY query generation, which firstly leverages DESC(?V ar1) LIMIT 1 OFFSET 1. multiple neural networks to predict query To classify various queries with similar query substructures contained in the given question, intentions and narrow the search space for query and then ranks existing query structures by generation, we introduce the notion of query struc- using a combinational function. tures. A query structure is a set of structurally- • We merge query substructures to build new equivalent queries. Let Qa = (Va; Ta) and Qb = query structures, which handles questions (Vb; Tb) denote two queries. Qa is structurally- ∼ with unseen query structures. equivalent to Qb, denoted by Qa = Qb, if and only if there exist two bijections f : V !V and • We perform extensive experiments on two a b g : L (Q ) !L (Q ) such that: KBQA datasets, and show that SubQG e a e b significantly outperforms the existing ap- (i) 8v 2 Va, v is a variable , f(v) is a variable; proaches. Furthermore, SubQG achieves a 8r 2 L (Q ) r promising performance with limited training (ii) e a , is a user-defined property , g(r) r data and noisy entity/relation linking results. is a user-defined property; if is a built-in property, g(r) = r; 0 0 0 2 Preliminaries (iii) 8v8r8v hv; r; v i 2 Ta , hf(v); g(r); f(v )i 2 Tb. An entity is typically denoted by a URI and de- scribed with a set of properties and values. A fact The query structure for Qa is denoted by Sa = is an hentity; property; valuei triple, where the [Qa], which contains all the queries structurally- value can be either a literal or another entity. A equivalent to Qa. For graphical illustration, we KB is a pair K = (E; F), where E denotes the set represent a query structure by a representative of entities and F denotes the set of facts. query among the structurally-equivalent ones and A formal query (or simply called query) is the replace entities and literals with different kinds structured representation of a natural language of placeholders. An example of query and query question executable on a given KB. Formally, a structure is shown in the upper half of Figure2. query is a pair Q = (V; T ), where V denotes the For many simple questions, two query struc- set of vertices, and T denotes the set of labeled tures, i.e., (f?V ar1; Ent1g; fh?V ar1; P rop1; edges. A vertex can be either a variable, an entity Ent1ig) and (f?V ar1; Ent1g; fhEnt1; P rop1; or a literal, and the label of an edge can be either ?V ar1ig), are sufficient. However, for complex 2615 (a) Offline training (b) Online query generation How many movies have the same director as The Shawshank Redemption? ① ② ③ ④ Predictor for each Input question ?Var2 Ent1 Ent1 ?Var1 Prop1 query substructure COUNT Class1 ?Var2 Prop1 Prop1 1. Predict ISA Prop1 query ?Var1 ?Var1 ?Var1 Ent1 3. Train query substructures 1. Predict query substructures contained in the question substructure predictors Merge substructures ①② Query structure in training data Probability of containing score = 0.86 contains ①②③ score = 0.62 contains ①④ Frequent query each query substructure COUNT ISA COUNT ISA substructures ?Var2 ?Var1 Class1 ?Var2 ?Var1 Class1 Prop1 Prop1 Prop1 Prop1 2. Collect 2. Rank existing 3. Merge query ?Var3 Ent1 ?Var3 Ent1 frequent query query structures substructures 2. Rank existing query structures or 3. merge substructures to new structures substructures Ent. & rel. linking: “movies” = dbo:Film; “director” = dbo:director; “The Shawshank Redemption” = dbr:TSR Ranked existing Merged query All existing correct answer (?Var2 = 4) empty query query structures query structures substructures Grounding result: Grounding result: 1. Collect query Class1 = dbo:Film; Ent1 = dbr:TSR; Class1 = dbo:Film; Ent1 = dbr:TSR; Prop1 = dbo:director Prop1 = dbo:director structures 4. Grounding Validation: domain/range checked; Validation: dbo:Film is not the range and validation query result is not empty of dbo:director Training data: 4.

Load more