Leveraging Frequent Query Substructures to Generate Formal Queries for Complex

Jiwei Ding Wei Hu∗ Qixin Xu Yuzhong Qu∗ State Key Laboratory for Novel Software Technology, Nanjing University, China {jwding, qxxu}[email protected], {whu, yzqu}@nju.edu.cn

Abstract How many movies have the same director as The Shawshank Redemption?

COUNT ISA Formal query generation aims to generate cor- ?Var2 ?Var1 dbo:Film rect executable queries for question answer- dbo:director ?Var3 dbr:TSR ing over knowledge bases (KBs), given en- dbo:director tity and relation linking results. Current ap- Figure 1: An example for complex question and query proaches build universal paraphrasing or rank- ing models for the whole questions, which are represents a complex structure shown in the mid- likely to fail in generating queries for com- dle of Figure1; (ii) recognizing and paraphras- plex, long-tail questions. In this paper, we ing aggregations (e.g., “how many” corresponds propose SubQG, a new query generation ap- to COUNT); and (iii) organizing all the above to proach based on frequent query substructures, generate an executable query (Singh et al., 2018; which helps rank the existing (but nonsignif- Zafar et al., 2018). icant) query structures or build new query structures. Our experiments on two bench- There are mainly two kinds of query generation mark datasets show that our approach signif- approaches for complex questions. (i) Template- icantly outperforms the existing ones, espe- based approaches choose a pre-collected template cially for complex questions. Also, it achieves for query generation (Cui et al., 2017; Abuja- promising performance with limited training bal et al., 2017). Such approaches highly rely data and noisy entity/relation linking results. on the coverage of templates, and perform unsta- 1 Introduction bly when some complex templates have very few natural language questions as training data. (ii) Knowledge-based question answering (KBQA) Approaches based on semantic and neu- aims to answer natural language questions over ral networks learn entire representations for ques- knowledge bases (KBs) such as DBpedia and tions with different query structures, by using a Freebase. Formal query generation is an important neural network following the encode-and-compare component in many KBQA systems (Bao et al., framework (Luo et al., 2018; Zafar et al., 2018). 2016; Cui et al., 2017; Luo et al., 2018), especially They may suffer from the lack of training data, for answering complex questions. Given entity especially for long-tail questions with rarely ap- and relation linking results, formal query gener- peared structures. Furthermore, both above ap- ation aims to generate correct executable queries, proaches cannot handle questions with unseen e.g., SPARQL queries, for the input natural lan- query structures, since they cannot generate new guage questions. An example question and its for- query structures. mal query are shown in Figure1. Generally speak- To cope with the above limitations, we propose ing, formal query generation is expected to include a new query generation approach based on the but not be limited to have the capabilities of (i) following observation: the query structure for a recognizing and paraphrasing different kinds of complex question may rarely appear, but it usu- constraints, including triple-level constraints (e.g., ally contains some substructures that frequently “movies” corresponds to a typing constraint for appeared in other questions. For example, the the target variable) and higher level constraints query structure for the question in Figure1 ap- (e.g., subgraphs). For instance, “the same ... as” pears rarely, however, both “how many movies” ∗Corresponding authors and “the same ... as” are common expressions,

2614 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2614–2622, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics which correspond to the two query substructures How many movies were directed by the graduate of Burbank High School? (a) Query (b) Query structure in dashed boxes. To collect such frequently ap- COUNT ISA COUNT ISA ?Var2 ?Var1 dbo:Film ?Var2 ?Var1 Class1 peared substructures, we automatically decom- dbo:director Prop1 ?Var3 dbr:BHS ?Var3 Ent1 pose query structures in the training data. In- dbp:edu Prop2 stead of directly modeling the query structure for the given question as a whole, we employ multi- ?Var1 ?Var2 ?Var2 Ent1 Class1 Prop1 Ent1 COUNT COUNT Prop2 ple neural networks to predict query substructures Prop2 ISA ?Var3 ?Var1 ?Var1 ?Var3 contained in the question, each of which delivers (c) Query substructures a part of the query intention. Then, we select an Figure 2: Illustration of a query, a query structure and existing query structure for the input question by query substructures using a combinational ranking function. Also, in some cases, no existing query structure is appro- a built-in property or a user-defined one. For sim- priate for the input question. To cope with this plicity, the set of all edge labels are denoted by issue, we merge query substructures to build new Le(Q). In this paper, the built-in properties in- query structures. The contributions of this paper clude COUNT,AVG,MAX,MIN,MAXATN,MI- are summarized below: NATN and ISA(RDF:TYPE), where the former four are used to connect two variables. For ex- • We formalize the notion of query structures ample, h?V ar1, COUNT, ?V ar2i represents that and define the substructure relationship be- ?V ar2 is the counting result of ?V ar1.MAX- tween query structures. ATN and MINATN take the meaning of OR- DER BY in SPARQL (Bao et al., 2016). For in- • We propose a novel approach for formal stance, h?V ar1, MAXATN, 2i means ORDER BY query generation, which firstly leverages DESC(?V ar1) LIMIT 1 OFFSET 1. multiple neural networks to predict query To classify various queries with similar query substructures contained in the given question, intentions and narrow the search space for query and then ranks existing query structures by generation, we introduce the notion of query struc- using a combinational function. tures. A query structure is a set of structurally- • We merge query substructures to build new equivalent queries. Let Qa = (Va, Ta) and Qb = query structures, which handles questions (Vb, Tb) denote two queries. Qa is structurally- ∼ with unseen query structures. equivalent to Qb, denoted by Qa = Qb, if and only if there exist two bijections f : V → V and • We perform extensive experiments on two a b g : L (Q ) → L (Q ) such that: KBQA datasets, and show that SubQG e a e b significantly outperforms the existing ap- (i) ∀v ∈ Va, v is a variable ⇔ f(v) is a variable; proaches. Furthermore, SubQG achieves a ∀r ∈ L (Q ) r promising performance with limited training (ii) e a , is a user-defined property ⇔ g(r) r data and noisy entity/relation linking results. is a user-defined property; if is a built-in property, g(r) = r; 0 0 0 2 Preliminaries (iii) ∀v∀r∀v hv, r, v i ∈ Ta ⇔ hf(v), g(r), f(v )i ∈ Tb. An entity is typically denoted by a URI and de- scribed with a set of properties and values. A fact The query structure for Qa is denoted by Sa = is an hentity, property, valuei triple, where the [Qa], which contains all the queries structurally- value can be either a literal or another entity. A equivalent to Qa. For graphical illustration, we KB is a pair K = (E, F), where E denotes the set represent a query structure by a representative of entities and F denotes the set of facts. query among the structurally-equivalent ones and A formal query (or simply called query) is the replace entities and literals with different kinds structured representation of a natural language of placeholders. An example of query and query question executable on a given KB. Formally, a structure is shown in the upper half of Figure2. query is a pair Q = (V, T ), where V denotes the For many simple questions, two query struc- set of vertices, and T denotes the set of labeled tures, i.e., ({?V ar1, Ent1}, {h?V ar1, P rop1, edges. A vertex can be either a variable, an entity Ent1i}) and ({?V ar1, Ent1}, {hEnt1, P rop1, or a literal, and the label of an edge can be either ?V ar1i}), are sufficient. However, for complex

2615 (a) Offline training (b) Online query generation How many movies have the same director as The Shawshank Redemption? ① ② ③ ④ Predictor for each Input question ?Var2 Ent1 Ent1 ?Var1 Prop1 query substructure COUNT Class1 ?Var2 Prop1 Prop1 1. Predict ISA Prop1 query ?Var1 ?Var1 ?Var1 Ent1 3. Train query substructures 1. Predict query substructures contained in the question substructure predictors Merge substructures ①② Query structure in training data Probability of containing score = 0.86 contains ①②③ score = 0.62 contains ①④ Frequent query each query substructure COUNT ISA COUNT ISA substructures ?Var2 ?Var1 Class1 ?Var2 ?Var1 Class1 Prop1 Prop1 Prop1 Prop1 2. Collect 2. Rank existing 3. Merge query ?Var3 Ent1 ?Var3 Ent1 frequent query query structures substructures 2. Rank existing query structures or 3. merge substructures to new structures substructures Ent. & rel. linking: “movies” = dbo:Film; “director” = dbo:director; “The Shawshank Redemption” = dbr:TSR Ranked existing Merged query All existing correct answer (?Var2 = 4) empty query query structures query structures substructures Grounding result: Grounding result: 1. Collect query Class1 = dbo:Film; Ent1 = dbr:TSR; Class1 = dbo:Film; Ent1 = dbr:TSR; Prop1 = dbo:director Prop1 = dbo:director structures 4. Grounding Validation: domain/range checked; Validation: dbo:Film is not the range and validation query result is not empty of dbo:director Training data: 4. Grounding and validation Output query pairs Figure 4: An example for online query generation Figure 3: Framework of the proposed approach Offline. The offline process takes as input a questions, a diversity of query structures exist and set of training data in form of hquestion, queryi some of them share a set of frequently-appeared pairs, and mainly contains three steps: substructures, each of which delivers a part of the 1. Collect query structures. For questions in query intention. We give the definition of query the training data, we first discover the structurally- substructures as follows. equivalent queries, and then extract the set of all Let Sa = [Qa] and Sb = [Qb] denote two query query structures, denoted by TS. structures. Sa is a query substructure of Sb, de- 2. Collect frequent query substructures. We de- noted by Sa  Sb, if and only if Qb has a sub- compose each query structure Si = (Vi, Ti) ∈ TS ∼ graph Qc such that Qa = Qc. Furthermore, if to get the set for all query substructures. Let Tj be

Sa = [Qa]  Sb = [Qb], we say that Qb has Sa, a non-empty subset of Ti, and VTj be the set of ver- and Sa is contained in Qb. tices used in Tj. Sj = (VTj , Tj) should be a query For example, although the query structures for substructure of Si according to the definition. So, the two questions in Figures1 and2 are different, we can generate all query substructures of Si from they share the same query substructure ({?V ar1, each subset of Ti. Disconnected query substruc- ?V ar2, Class1}, {h?V ar1, COUNT, ?V ar2i, tures would be ignored since they express discon- h?V ar1, ISA, Class1i}), which corresponds to tinuous meanings and should be split into smaller the phrase “how many movies”. Note that, a query query substructures. If more than γ queries in substructure can be the query structure of another training data have substructure Sj, we consider Sj question. as a frequent query substructure. The set for all The goal of this paper is to leverage a set of frequent query substructures is denoted by FS∗. frequent query (sub-)structures to generate formal 3. Train query substructure predictors. We queries for answering complex questions. train a neural network for each query substructure ∗ ∗ y Si ∈ FS , to predict the probability that Q has ∗ ∗ y 3 The Proposed Approach Si (i.e., Si  [Q ]) for input question y, where Qy denotes the formal query for y. Details for this In this section, we present our approach, SubQG, step are described in Section 3.2. for query generation. We first introduce the frame- work and general steps with a running example Online. The online query generation process (Section 3.1), and then describe some important takes as input a natural language question y, and steps in detail in the following subsections. mainly contains four steps: 1. Predict query substructures. We first predict 3.1 Framework ∗ y ∗ ∗ the probability that Si  [Q ] for each Si ∈ FS , Figure3 depicts the framework of SubQG, which using the query substructure predictors trained in contains an offline training process and an online the offline step. An example question and four query generation process. query substructures with highest prediction prob-

2616 ∗ abilities are shown in the top of Figure4. Output : 1 푃푟[푆푖 |푦]

2. Rank existing query structures. To find an WeightedSum : Att(ℎ푡) 200

appropriate query structure for the input question, +

… … we rank existing query structures (Si ∈ TS) by 200 200 1 using a scoring function, see Section 3.3. 훼1× 훼2× 훼3× 훼푇−1× 훼푇× 3. Merge query substructures. Consider the fact ℎ ℎ ℎ ℎ ℎ y 1 2 3 푇−1 푇 that the target query structure [Q ] may not ap- BiLSTM : 200 200 200 … 200 200 pear in TS (i.e., there is no query in the training Embedding : 100 100 100 … 100 100 data that is structurally-equivalent to Qy), we de- Input: How many movies … as sign a method (described in Section 3.4) to merge Figure 5: Attention-based BiLSTM network question-contained query substructures for build- ing new query structures. The merged results are output layer and retrain when the set of frequent ranked using the same function as existing query query substructures changes. structures. Several query structures (including the The structure of the network is shown in Fig- merged results and the existing query structures) ure5. Before the input question is fed into the net- for the example question are shown in the middle work, we replace all entity mentions with hEntityi of Figure4. using EARL (Dubey et al., 2018), to enhance the 4. Grounding and validation. We leverage the generalization ability. Given the question se- query structure ranking result, alongside with the quence {w1, ..., wT }, we first use a word embed- entity/relation linking result from some existing ding matrix to convert the original sequence into black box systems (Dubey et al., 2018) to gen- word vectors {e1, ..., eT }, followed by a BiLSTM erate executable formal query for the input ques- network to generate the context-sensitive repre- tion. For each query structure, we try all possible sentation {h1, ..., hT } for each word, where combinations of the linking results according to −→ ←− h = [LSTM(e , h ); LSTM(e , h )]. (1) the descending order of the overall linking score, t t t−1 t t+1 and perform validation including grammar check, Then, the attention mechanism takes each ht domain/range check and empty query check. The as input, and calculates a weight αt for each ht, first non-empty query passing all validations is which is formulated as follows: considered as the output for SubQG. The ground- eAtt(ht) αt = , (2) ing and validation results for the example question PT Att(hk) k=1 e are shown in the bottom of Figure4. T Att(ht) = vatt tanh(Wattht + batt), (3)

|ht|×|ht| |ht| 3.2 Query Substructure Prediction where Watt ∈ R , batt ∈ R and vatt ∈ |h | R t . Next, we get the representation for the In this step, we employ an attention based Bi- c whole question q as the weighted sum of ht: LSTM network (Raffel and Ellis, 2015) to pre- T dict Pr[S∗ | y] for each frequent query substructure X i qc = α h . (4) S∗ ∈ FS∗, where Pr[S∗ | y] represents the proba- t t i i t=1 bility of S∗  [Qy]. There are mainly three rea- i The output of the network is a probability sons that we use a predictor for each query sub- ∗ T c structure instead of a multi-tag predictor for all Pr[Si | y] = σ(voutq + bout), (5) query substructures: (i) a query substructure usu- |qc| where vout ∈ R and bout ∈ R. ally expresses part of the meaning of input ques- The loss function minimized during training is tion. Different query substructures may focus on the binary cross-entropy: different or phrases, thus, each predictor ∗ X ∗ should have its own attention matrix; (ii) multi- Loss(Si ) = − log(Pr[Si | y]) tag predictor may have a lower accuracy since (y,Qy)∈Train s.t. S∗[Qy] each tag has unbalanced training data; (iii) sin- i (6) X ∗ gle pre-trained query substructure predictor from − log(1 − Pr[Si | y]), one dataset can be directly reused on another with- (y,Qy)∈Train s.t. S∗ [Qy] out adjusting the network structure, however, the i  multi-tag predictor need to adjust the size of the where Train denotes the set of training data.

2617 Algorithm 1: Query substructure merging ?Var2 Ent1 Prop1 Class1 ?Var2 Input: Question y, freq. query substructures FS∗ COUNT ?Var1 Prop1 ∗ + ∗ ∗ ∗ ?Var1 ISA ∗ 1 FS := {Si ∈ FS | Pr[Si | y] > 0.5}; (0) ∗ ∗ ∗ 2 M := {Si ∈ FS | Score[Si | y] > θ}; Share the same variable Share the same variable ∗ ∗ ∗ ∗ 3 for i = 1 to K do // K is maximum iterations .1=.1 .1=.2 (i) 4 M := ∅; COUNT ISA COUNT ISA ∗ + (i−1) ?Var2 ?Var1 Class1 ?Var2 ?Var1 Class1 5 forall Si ∈ FS , Sj ∈ M do (i) (i) ∗ Prop1 Prop1 Prop1 6 M := M ∪ Merge(Si , Sj ); Prop1 ?Var3 Ent1 ?Var3 Ent1 (i) (i) 7 M := {S ∈ M | Score[S | y] > θ}; l l Figure 6: Merge results for two query substructures SK (i) 8 return i=0 M ; (i) The merged results should be connected; 3.3 Query Structure Ranking (ii) The merged results have ≤ τ triples; In this step, we use a combinational function to (iii) The merged results have ≤ δ aggregations; score each query structure in the training data for the input question. Since the prediction result for An example for merging two query substruc- each query substructure is independent, the score tures is shown in Figure6. for query structure Si is measured by joint proba- bility, which is 4 Experiments and Results

Y ∗ In this section, we introduce the query generation Score(Si | y) = Pr[Sj | y] ∗ ∗ datasets and state-of-the-art systems that we com- Sj ∈FS ∗ pare. We first show the end-to-end results of the s.t. Sj Si (7) Y ∗ query generation task, and then perform detailed × (1 − Pr[Sj | y]). analysis to show the effectiveness of each mod- S∗∈FS∗ j ule. Question sets, source code and experimental s.t. S∗ S j  i results are available online.1 y ∗ ∗ Assume that Q ∈ Si, ∀Sj  Si, we have Sj  y ∗ 4.1 Experimental Setup [Q ]. Thus, Pr[Sj | y] should be 1 in the ideal con- ∗ ∗ dition. On the other hand, ∀Sj  Si, Pr[Sj | y] Datasets We employed the same datasets as should be 0. Thus, we have Score(Si | y) = 1, and Singh et al.(2018) and Zafar et al.(2018): (i) the ∀Sk 6= Si, we have Score(Sk | y) = 0. large-scale complex question answering dataset (LC-QuAD)(Trivedi et al., 2017), containing 3.4 Query Substructure Merging 3,253 questions with non-empty results on DBpe- We proposed a method, shown in Algorithm1, to dia (2016-04), and (ii) the fifth edition of ques- merge question-contained query substructures to tion answering over linked data (QALD-5) dataset build new query structures. In the initialization (Unger et al., 2015), containing 311 questions with step, it selects some query substructures of high non-empty results on DBpedia (2015-10). Both scores as candidates, since the query substruc- datasets are widely used in KBQA studies (Zou ture may directly be the appropriate query struc- et al., 2014; Dubey et al., 2018), and have be- ture for the input question. In each iteration, the come benchmarks for some annual KBQA com- method merges each question-contained substruc- petitions23. We did not employ the WebQuestions tures with existing candidates, and the merged re- (Berant et al., 2013) dataset, since approximately sults of high scores are used as candidates in the 85% of its questions are simple. Also, we did not next iteration. The final output is the union of all employ the ComplexQuestions (Bao et al., 2016) the results from at most K iterations. and ComplexWebQuestions (Talmor and Berant, When merging different query substructures, 2018) dataset, since the existing works on these we allow them to share some vertices of the same datasets have not reported the formal query gener- kind (variable, entity, etc.) or edge labels, except ation result, and it is difficult to separate the formal the variables which represent aggregation results. query generation component from the end-to-end Thus, the merged result of two query substructures KBQA systems in these works. is a set of query structures instead of one. Also, the 1http://ws.nju.edu.cn/SubQG/ following restrictions are used to filter the merged 2http://lc-quad.sda.tech results: 3http://qald.aksw.org/index.php?q=5

2618 Table 1: Datasets and implementation details Table 2: Average F1-scores of query generation LC-QuAD QALD-5 LC-QuAD QALD-5 Sina (Shekarpour et al., 2015) 0.24 † 0.39 † No. of questions (complex) 3,253 (2,249) 311 (192) NLIWOD 4 0.48 † 0.49 † No. of query structures 35 52 SQG (Zafar et al., 2018) 0.75 † - No. of freq. substructures 37 10 CompQA (Luo et al., 2018) 0.772±0.014 0.511±0.043 Avg. training time 1,102s 272s SubQG (our approach) 0.846±0.016 0.624±0.030 † indicates results taken from Singh et al.(2018) and SQG. Avg. prediction time 0.291s 0.122s Avg. query generation time 0.356s 0.197s Table 3: Average F1-scores for complex questions LC-QuAD QALD-5 Implementation details All the experiments CompQA 0.673±0.009 0.260±0.082 were carried out on a machine with an Intel Xeon SubQG 0.779±0.017 0.392±0.156 E3-1225 3.2GHz processor, 32 GB of RAM, and coverage are limited by the rules and templates. an NVIDIA GTX1080Ti GPU. For the embed- Although both CompQA and SQG have a strong ding layer, we used random embedding. For each ability of generating candidate queries, they per- dataset, we performed 5-fold cross-validation with form not quite well in query ranking. According the train set (70%), development set (10%), and to our observation, the main reason is that these test set (20%). The threshold γ for frequent query approaches tried to learn entire representations substructures is set to 30, the maximum iteration for questions with different query structures (from number K for merging is set to 2, θ in Algorithm1 simple to complex) using a single network, thus, is set to 0.3, the maximum triple number τ for they may suffer from the lack of training data, merged results is set to 5, and the maximum aggre- especially for the questions with rarely appeared gation number δ is set to 2. Other detailed statis- structures. As a contrast, our approach leveraged tics are shown in Table1. multiple networks to learn predictors for different 4.2 End-to-End Results query substructures, and ranked query structures We compared SubQG with several existing ap- using combinational function, which gained a bet- proaches. SINA (Shekarpour et al., 2015) and ter performance. NLIWOD conduct query generation by prede- The results on QALD-5 dataset is not as high as fined rules and existing templates. SQG (Zafar the result on LC-QuAD. This is because QALD-5 et al., 2018) firstly generates candidate queries contains 11% of very difficult questions, requir- by finding valid walks containing all of enti- ing complex filtering conditions such as REGEX ties and properties mentioned in questions, and and numerical comparison. These questions are then ranks them based on Tree-LSTM similarity. currently beyond our approach’s ability. Also, the CompQA (Luo et al., 2018) is a KBQA system size of training data is significant smaller. which achieved state-of-the-art performance on 4.3 Detailed Analysis WebQuesions and ComplexQuestions over Free- base. We re-implemented its query generation 4.3.1 Ablation Study component for DBpedia, which generates candi- We compared the following settings of SubQG: date queries by staged query generation, and ranks Rank w/o substructures. We replaced the them using an encode-and-compare network. query substructure prediction and query structure The average F1-scores for the end-to-end query ranking module, by choosing an existing query generation task are reported in Table2. All structure in the training data for the input question, these results are based on the gold standard en- using a BiLSTM multiple classification network. tity/relation linking result as input. Our ap- Rank w/ substructures We removed the merg- proach SubQG outperformed all the comparative ing module described in Section 3.4. This setting approaches on both datasets. Furthermore, as the assumes that the appropriate query structure for an results shown in Table3, it gained a more sig- input question exists in the training data. nificant improvement on complex questions com- Merge query substructures This setting ig- pared with CompQA. nored existing query structures in the training data, Both SINA and NLIWOD did not employ a and only considered the merged results of query query ranking mechanism, i.e., their accuracy and substructures. 4https://github.com/dice-group/NLIWOD As the results shown in Table4, the full ver-

2619 Table 4: Average F1-scores for different settings Table 6: Average Precision@k scores of query genera- LC-QuAD QALD-5 tion on LC-QuAD with noisy linking SubQG 0.846±0.016 0.624±0.030 Precision@1 Precision@5 Rank w/o substructures 0.756±0.012 0.383±0.024 Gold standard 0.842±0.017 0.886±0.014 Rank w/ substructures 0.841±0.014 0.614±0.036 Top-5 EARL + gold standard 0.728±0.011 0.850±0.009 Merge query substructures 0.679±0.020 0.454±0.055 Top-5 EARL 0.126±0.012 0.146±0.010 Table 5: Accuracy of query substructure prediction 82.8% of the questions, EARL provided partially LC-QuAD QALD-5 correct results. If we consider the remaining ques- BiLSTM w/ attention 0.929±0.002 0.816±0.010 tions, our system again have 73.2% and 84.8% of BiLSTM w/o attention 0.898±0.004 0.781±0.009 correctly-generated queries in top-1 and top-5 out- BiLSTM w/ attention + POS 0.925 0.818 ±0.004 ±0.007 put, respectively. CNN in (Yih et al., 2015) 0.856±0.006 0.740±0.010 sion of SubQG achieved the best results on both 4.3.3 Results on Varied Sizes of Training datasets. Rank w/o substructures gained a compar- Data atively low performance, especially when there is inadequate training data (on QALD-5). Compared We tested the performance of SubQG with dif- with Rank w/ substructures, SubQG gained a fur- ferent sizes of training data. The results on LC- ther improvement, which indicates that the merg- QuAD dataset are shown in Figure7. With more ing method successfully handled questions with training data, our query substructure based ap- unseen query structures. proaches obtained stable improvements on both precision and recall. Although the merging mod- Table5 shows the accuracy of some alternative ule impaired the overall precision a little bit, it networks for query substructure prediction (Sec- shows a bigger improvement on recall, especially tion 3.2). By removing the attention mechanism when there is very few training data. Gener- (replaced by unweighted average), the accuracy ally speaking, equipped with the merging module, declined approximately 3%. Adding additional our substructure based query generation approach part of speech tag sequence of the input question showed the best performance. gained no significant improvement. We also tried to replace the attention based BiLSTM with the 4.3.4 Error Analysis network in (Yih et al., 2015), which encodes ques- tions with a convolutional layer followed by a max We analyzed 100 randomly sampled questions that pooling layer. This approach did not perform well SubQG did not return correct answers. The major since it cannot capture long-term dependencies. causes of errors are summarized as follows: Query structure errors (71%) occurred due to 4.3.2 Results with Noisy Linking multiple reasons. Firstly, 21% of error cases have We simulated the real KBQA environment by con- entity mentions that are not correctly detected be- sidering noisy entity/relation linking results. We fore query substructure prediction, which highly firstly mixed the correct linking result for each influenced the prediction result. Secondly, in 39% mention with the top-5 candidates generated from of the cases a part of substructure predictors pro- EARL (Dubey et al., 2018), which is a joint en- vided wrong prediction, which led to wrong struc- tity/relation linking system with state-of-the-art ture ranking results. Finally, in the remaining 11% performance on LC-QuAD. The result is shown in of the cases the correct query structure did not ap- the second row of Table6. Although the precision pear in the training data, and they cannot be gen- for first output declined 11.4%, in 85% cases we erated by merging substructures. still can generate correct answer in top-5. This is Grounding errors (29%) occurred when because SubQG ranked query structures first and SubQG generated wrong queries with cor- considered linking results in the last step. Many rect query structures. For example, for the error linking results can be filtered out by the question “Was Kevin Rudd the prime minis- empty query check or domain/range check. ter of Julia Gillard”, SubQG cannot distin- We also test the performance of our approach guish hJG, primeMinister, KRi from hKR, only using the EARL linking results. The per- primeMinister, JGi, since both triples exist in formance dropped dramatically in comparison to DBpedia. We believe that extra training data are the first two rows. The main reason is that, for required for fixing this problem.

2620 0.9 0.9 0.9 0.813 0.842 0.846 0.839 0.853 0.822 0.8 0.849 0.746 0.829 0.764 0.793 0.790 0.8 0.841 0.8 0.771 0.831 0.685 0.719 0.814 0.7 0.714 0.746 0.783 0.7 0.751 0.756 0.757 0.703 0.766 Recall 0.6 0.717 F1-Score Precision 0.630 0.7 0.732 0.580 0.662 0.6 0.655 0.5 0.681 0.486 0.6 0.640 0.4 0.5 0.552 20% 40% 60% 80% 20% 40% 60% 80% 20% 40% 60% 80% Proportion of training data Proportion of training data Proportion of training data Figure 7: Precision, recall and F1-score with varied proportions of training data

5 Related Work tic query graph for question analysis and uti- lize subgraph matching for disambiguation. Re- Alongside with entity and relation linking, exist- cent studies combine parsing based approaches ing KBQA systems often leverage formal query with neural networks, to enhance the ability for generation for complex question answering (Bao structure disambiguation. Bao et al.(2016), Luo et al., 2016; Trivedi et al., 2017). Based on our in- et al.(2018) and Zafar et al.(2018) build query vestigation, the query generation approaches can graphs by staged query generation, and follow an be roughly divided into two kinds: template-based encode-and-compare framework to rank candidate and -based. queries with neural networks. These approaches Template-based approaches transform the input try to learn entire representations for questions question into a formal query by employing pre- with different query structures by using a single collected query templates. Cui et al.(2017) col- network. Thus, they may suffer from the lack of lect different natural language expressions for the training data, especially for questions with rarely same query intention from question-answer pairs. appeared structures. By contrast, our approach uti- Singh et al.(2018) re-implement and evaluate the lizes multiple networks to learn predictors for dif- query generation module in NLIWOD, which se- ferent query substructures, which can gain a stable lects an existing template by some simple features performance with limited training data. Also, our such as the number of entities and relations in approach does not require manually-written rules, the input question. Recently, several query de- and performs stably with noisy linking results. composition methods are studied to enlarge the coverage of the templates. Abujabal et al.(2017) present a KBQA system named QUINT, which 6 Conclusion collects query templates for specific dependency structures from question-answer pairs. Further- In this paper, we introduced SubQG, a formal more, it rewrites the dependency parsing results query generation approach based on frequent for questions with conjunctions, and then per- query substructures. SubQG firstly utilizes multi- forms sub-question answering and answer stitch- ple neural networks to predict query substructures ing. Zheng et al.(2018) decompose questions by contained in the question, and then ranks existing using a huge number of triple-level templates ex- query structures using a combinational function. tracted by distant supervision. Compared with Moreover, SubQG merges query substructures to these approaches, our approach predicts all kinds build new query structures for questions without of query substructures (usually 1 to 4 triples) con- appropriate query structures in the training data. tained in the question, making full use of the train- Our experiments showed that SubQG achieved su- ing data. Also, our merging method can handle perior results than the existing approaches, espe- questions with unseen query structures, having a cially for complex questions. larger coverage and a more stable performance. In future work, we plan to add support for other Semantic parsing-based approaches translate complex questions whose queries require UNION, questions into formal queries using bottom up GROUP BY, or numerical comparison. Also, we parsing (Berant et al., 2013) or staged query graph are interested in mining natural language expres- generation (Yih et al., 2015). gAnswer (Zou sions for each query substructures, which may et al., 2014; Hu et al., 2018) builds up seman- help current parsing approaches.

2621 7 Acknowledgments Kuldeep Singh, Arun Sethupat Radhakrishna, An- dreas Both, Saeedeh Shekarpour, Ioanna Lytra, Ri- This work is supported by the National Natural cardo Usbeck, Akhilesh Vyas, Akmal Khikmatul- Science Foundation of China (Nos. 61772264 and laev, Dharmen Punjani, Christoph Lange, Maria- 61872172). We would like to thank Yao Zhao for Esther Vidal, Jens Lehmann, and Soren¨ Auer. 2018. Why reinvent the wheel: Let’s build question an- his help in preparing evaluation. swering systems together. In Proceedings of the 27th International Conference on , TheWebConf 2018, pages 1247–1256. References Alon Talmor and Jonathan Berant. 2018. The Web as Abdalghani Abujabal, Mohamed Yahya, Mirek Riede- a knowledge-base for answering complex questions. wald, and Gerhard Weikum. 2017. Automated tem- In Proceedings of the 2018 Conference of the North plate generation for question answering over knowl- American Chapter of the Association for Computa- edge graphs. In Proceedings of the 26th Interna- tional Linguistics: Human Language Technologies, tional Conference on World Wide Web, WWW 2017, NAACL-HLT 2018, pages 641–651. pages 1191–1200. Priyansh Trivedi, Gaurav Maheshwari, Mohnish Jun-Wei Bao, Nan Duan, Zhao Yan, Ming Zhou, and Dubey, and Jens Lehmann. 2017. LC-QuAD: A cor- Tiejun Zhao. 2016. Constraint-based question an- pus for complex question answering over knowledge swering with . In Proceedings of graphs. In Proceedings of the 16th International Se- the 26th International Conference on Computational mantic Web Conference, ISWC 2017, Part II, pages Linguistics, COLING 2016, pages 2503–2514. 210–218. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Christina Unger, Corina Forascu, Vanessa Lopez,´ Liang. 2013. Semantic parsing on freebase from Axel-Cyrille Ngonga Ngomo, Elena Cabrio, Philipp question-answer pairs. In Proceedings of the 2013 Cimiano, and Sebastian Walter. 2015. Question an- Conference on Empirical Methods in Natural Lan- swering over linked data (QALD-5). In Working guage Processing, EMNLP 2013, pages 1533–1544. Notes of Conference and Labs of the Evaluation Fo- rum, CLEF 2015. Wanyun Cui, Yanghua Xiao, Haixun Wang, Yangqiu Song, Seung-won Hwang, and Wei Wang. 2017. Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and KBQA: learning question answering over QA cor- Jianfeng Gao. 2015. Semantic parsing via staged pora and knowledge bases. Proceedings of the query graph generation: Question answering with VLDB Endowment, 10(5):565–576. . In Proceedings of the 53rd Annual Meeting of the Association for Computational Lin- Mohnish Dubey, Debayan Banerjee, Debanjan Chaud- guistics, ACL 2015 huri, and Jens Lehmann. 2018. EARL: joint en- , pages 1321–1331. tity and relation linking for question answering over Hamid Zafar, Giulio Napolitano, and Jens Lehmann. knowledge graphs. In Proceedings of the 17th In- 2018. Formal query generation for question an- ternational Conference, ISWC 2018, swering over knowledge bases. In Proceedings of Part I, pages 108–126. the 15th Extended Semantic Web Conference, ESWC Sen Hu, Lei Zou, Jeffrey Xu Yu, Haixun Wang, and 2018, pages 714–728. Dongyan Zhao. 2018. Answering natural language Weiguo Zheng, Jeffrey Xu Yu, Lei Zou, and Hong questions by subgraph matching over knowledge Cheng. 2018. Question answering over knowledge graphs. IEEE Transactions on Knowledge and Data graphs: Question understanding via template de- Engineering, 30(5):824–837. composition. Proceedings of the VLDB Endowment, Kangqi Luo, Fengli Lin, Xusheng Luo, and Kenny Q. 11(11):1373–1386. Zhu. 2018. Knowledge base question answering via Lei Zou, Ruizhe Huang, Haixun Wang, Jeffrey Xu Yu, encoding of complex query graphs. In Proceedings Wenqiang He, and Dongyan Zhao. 2014. Natu- of the 2018 Conference on Empirical Methods in ral language question answering over RDF: a graph Natural Language Processing, EMNLP 2018, pages data driven approach. In Proceedings of the Inter- 2185–2194. national Conference on Management of Data, SIG- Colin Raffel and Daniel P. W. Ellis. 2015. Feed- MOD 2014, pages 313–324. forward networks with attention can solve some long-term memory problems. CoRR, abs/1512.08756. Saeedeh Shekarpour, Edgard Marx, Axel- Cyrille Ngonga Ngomo, and Soren¨ Auer. 2015. SINA: semantic interpretation of user queries for question answering on interlinked data. Journal of Web Semantics, 30:39–51.

2622