<<

Knowledge-Based as Machine Translation

Junwei Bao† ∗, Nan Duan‡ , Ming Zhou‡ , Tiejun Zhao† Harbin Institute of Technology† Microsoft Research‡ [email protected] {nanduan, mingzhou}@microsoft.com [email protected]

Abstract 2013; Poon, 2013; Artzi et al., 2013; Kwiatkowski et al., 2013; Berant et al., 2013); Then, the answer- A typical knowledge-based question an- s are retrieved from existing KBs using generated swering (KB-QA) system faces two chal- MRs as queries. lenges: one is to transform natural lan- Unlike existing KB-QA systems which treat se- guage questions into their meaning repre- mantic parsing and answer retrieval as two cas- sentations (MRs); the other is to retrieve caded tasks, this paper presents a unified frame- answers from knowledge bases (KBs) us- work that can integrate semantic parsing into the ing generated MRs. Unlike previous meth- question answering procedure directly. Borrow- ods which treat them in a cascaded man- ing ideas from machine translation (MT), we treat ner, we present a translation-based ap- the QA task as a translation procedure. Like MT, proach to solve these two tasks in one u- CYK parsing is used to parse each input question, nified framework. We translate questions and answers of the span covered by each CYK cel- to answers based on CYK parsing. An- l are considered the translations of that cell; un- swers as translations of the span covered like MT, which uses offline-generated translation by each CYK cell are obtained by a ques- tables to translate source phrases into target trans- tion translation method, which first gener- lations, a semantic parsing-based question trans- ates formal triple queries as MRs for the lation method is used to translate each span into span based on question patterns and re- its answers on-the-fly, based on question patterns lation expressions, and then retrieves an- and relation expressions. The final answers can be swers from a given KB based on triple obtained from the root cell. Derivations generated queries generated. A linear model is de- during such a translation procedure are modeled fined over derivations, and minimum er- by a linear model, and minimum error rate train- ror rate training is used to tune feature ing (MERT) (Och, 2003) is used to tune feature weights based on a set of question-answer weights based on a set of question-answer pairs. pairs. Compared to a KB-QA system us- Figure 1 shows an example: the question direc- ing a state-of-the-art semantic parser, our tor of movie starred by Tom Hanks is translated to method achieves better results. one of its answers Robert Zemeckis by three main 1 Introduction steps: (i) translate director of to director of ; (ii) translate movie starred by Tom Hanks to one of it- Knowledge-based question answering (KB-QA) s answers Forrest Gump; (iii) translate director of computes answers to natural language (NL) ques- Forrest Gump to a final answer Robert Zemeckis. tions based on existing knowledge bases (KBs). Note that the updated question covered by Cell[0, Most previous systems tackle this task in a cas- 6] is obtained by combining the answers to ques- caded manner: First, the input question is trans- tion spans covered by Cell[0, 1] and Cell[2, 6]. formed into its meaning representation (MR) by The contributions of this work are two-fold: (1) an independent semantic parser (Zettlemoyer and We propose a translation-based KB-QA method Collins, 2005; Mooney, 2007; Artzi and Zettle- that integrates semantic parsing and QA in one moyer, 2011; Liang et al., 2011; Cai and Yates, unified framework. The benefit of our method This∗ work was finished while the author was visiting Mi- is that we don’t need to explicitly generate com- crosoft Research Asia. plete semantic structures for input questions. Be- Hanks, Film.Actor.Film, Forrest Gumpi6 and (iii) director of Forrest Gump ⟹ Robert Zemeckis 2 hForrest Gump, Film.Film.Director, Robert Cell[0, 6] (ii) movie starred by Tom Hanks ⟹ Forrest Gump 6 Zemeckisi0 are three ordered formal triples (i) director of ⟹ director of corresponding to the three translation steps in Cell[2, 6] Figure 1. We define the task of transforming question spans into formal triples as question Cell[0, 1] translation. A denotes one final answer of Q.

th director of movie starred by Tom Hanks • hi(·) denotes the i feature function.

• λ denotes the feature weight of h (·). Figure 1: Translation-based KB-QA example i i According to the above description, our KB- QA method can be decomposed into four tasks as: sides which, answers generated during the transla- (1) search space generation for H(Q); (2) ques- tion procedure help significantly with search space tion translation for transforming question spans in- pruning. (2) We propose a robust method to trans- to their corresponding formal triples; (3) feature form single-relation questions into formal triple design for h (·); and (4) feature weight tuning for queries as their MRs, which trades off between i {λ }. We present details of these four tasks in the transformation accuracy and recall using question i following subsections one-by-one. patterns and relation expressions respectively.

2 Translation-Based KB-QA 2.2 Search Space Generation We first present our translation-based KB-QA 2.1 Overview method in Algorithm 1, which is used to generate Formally, given a KB and an N- H(Q) for each input NL question Q. L question Q, our KB-QA method generates a set of formal triples-answer pairs {hD, Ai} as deriva- Algorithm 1: Translation-based KB-QA tions, which are scored and ranked by the distribu- 1 for l = 1 to |Q| do tion P (hD, Ai|KB, Q) defined as follows: 2 for all i, j s.t. j − i = l do j 3 H(Qi ) = ∅; j PM 4 T = QT rans(Q , KB); exp{ i=1 λi · hi(hD, Ai, KB, Q)} i 5 foreach formal triple t ∈ T do P PM 0 0 hD0 ,A0 i∈H(Q) exp{ i=1 λi · hi(hD , A i, KB, Q)} 6 create a new derivation d; 7 d.A = t.eobj ; 1 8 d.D = {t}; •KB denotes a knowledge base that stores a 9 update the model score of d; set of assertions. Each assertion t ∈ KB is in j 10 insert d to H(Qi ); ID ID the form of {esbj, p, eobj}, where p denotes 11 end a predicate, eID and eID denote the subject 12 end sbj obj 13 end 2 and object entities of t, with unique IDs . 14 for l = 1 to |Q| do 15 for all i, j s.t. j − i = l do •H(Q) denotes the search space {hD, Ai}. D 16 for all m s.t. i ≤ m < j do m j 17 d ∈ H(Q ) and d ∈ H(Q ) is composed of a set of ordered formal triples for l i r m+1 do j 18 Qupdate = dl.A + dr.A; {t1, ..., tn}. Each triple t = {esbj, p, eobj}i ∈ 19 T = QT rans(Qupdate, KB); D denotes an assertion in KB, where i and 20 foreach formal triple t ∈ T do j denotes the beginning and end indexes of 21 create a new derivation d; 22 d.A = t.eobj ; S S the question span from which t is trans- 23 d.D = dl.D dr.D {t}; formed. The order of triples in D denotes 24 update the model score of d; j the order of translation steps from Q to A. 25 insert d to H(Qi ); E.g., hdirector of, Null, director of i1, hTom 26 end 0 27 end 1We use a large scale knowledge base in this paper, which 28 end contains 2.3B entities, 5.5K predicates, and 18B assertions. A 29 end 16-machine cluster is used to host and serve the whole data. 30 end 2 Each KB entity has a unique ID. For the sake of conve- 31 return H(Q). nience, we omit the ID information in the rest of the paper. The first half (from Line 1 to Line 13) gen- Algorithm 2: QP-based Question Translation erates a formal triple set T for each unary span 1 T = ∅; j Qi ∈ Q, using the question translation method 2 foreach entity mention eQ ∈ Q do QT rans(Qj, KB) (Line 4), which takes Qj as the 3 Qpattern = replace eQ in Q with [Slot]; i i 4 foreach question pattern QP do input. Each triple t ∈ T returned is in the form of 5 if Qpattern == QPpattern then j 6 E = Disambiguate(eQ, QPpredicate); {esbj, p, eobj}, where esbj’s mention occurs in Qi , 7 foreach e ∈ E do p is a predicate that denotes the meaning expressed 8 create a new triple query q; j by the context of esbj in Qi , eobj is an answer of 9 q = {e, QPpredicate, ?}; j 10 {Ai} = AnswerRetrieve(q, KB); Q based on esbj, p and KB. We describe the im- i 11 foreach A ∈ {Ai} do plementation detail of QT rans(·) in Section 2.3. 12 create a new formal triple t; The second half (from Line 14 to Line 31) first 13 t = {q.esbj , q.p, A}; j 14 t.score = 1.0; updates the content of each bigger span Qi by con- 15 insert t to T ; catenating the answers to its any two consecutive 16 end j 17 end smaller spans covered by Qi (Line 18). Then, QT rans(Qj, KB) is called to generate triples for 18 end i 19 end the updated span (Line 19). The above operations 20 end are equivalent to answering a simplified question, 21 return T . which is obtained by replacing the answerable spans in the original question with their corre- sponding answers. The search space H(Q) for the symbol [Slot], and a KB predicate QPpredicate, entire question Q is returned at last (Line 31). which denotes the meaning expressed by the con- text words in QP . 2.3 Question Translation pattern Algorithm 2 shows how to generate formal The purpose of question translation is to translate triples for a span Q based on question pattern- a span Q to a set of formal triples T . Each triple s (QP-based question translation). For each en- t ∈ T is in the form of {e , p, e }, where e ’s sbj obj sbj tity mention eQ ∈ Q, we replace it with [Slot] mention3 occurs in Q, p is a predicate that denotes and obtain a pattern string Qpattern (Line 3). If the meaning expressed by the context of e in sbj Qpattern can match one QPpattern, then we con- Q, e is an answer to Q retrieved from KB us- obj struct a triple query q (Line 9) using QPpredicate ing a triple query q = {esbj, p, ?}. Note that if as its predicate and one of the KB entities re- no predicate p or answer e can be generated, obj turned by Disambiguate(eQ, QPpredicate) as it- {Q, Null, Q} will be returned as a special triple, s subject entity (Line 6). Here, the objective of which sets e to be Q itself, and p to be Null. obj Disambiguate(eQ, QPpredicate) is to output a set This makes sure the un-answerable spans can be of disambiguated KB entities E in KB. The name passed on to the higher-level operations. of each entity returned equals the input entity Q Question translation assumes each span is a mention eQ and occurs in some assertions where single-relation question (Fader et al., 2013). Such QPpredicate are the predicates. The underlying assumption simplifies the efforts of semantic pars- idea is to use the context (predicate) information to ing to the minimum question units, while leaving help entity disambiguation. The answers of q are the capability of handling multiple-relation ques- returned by AnswerRetrieve(q, KB) based on q tions (Figure 1 gives one such example) to the out- and KB (Line 10), each of which is used to con- er CYK-parsing based translation procedure. Two struct a formal triple and added to T for Q (from question translation methods are presented in the Line 11 to Line 16). Figure 2 gives an example. rest of this subsection, which are based on ques- Question patterns are collected as follows: First, tion patterns and relation expressions respectively. 5W queries, which begin with What, Where, Who, 2.3.1 Question Pattern-based Translation When, or Which, are selected from a large scale query log of a commercial search engine; Then, a A question pattern QP includes a pattern string cleaned entity dictionary is used to annotate each QPpattern, which is composed of words and a slot query by replacing all entity mentions it contains 3For simplicity, a cleaned entity dictionary dumped from with the symbol [Slot]. Only high-frequent query the entire KB is used to detect entity mentions in Q. patterns which contain one [Slot] are maintained; 퓠 : who is the director of Forrest Gump Algorithm 3: RE-based Question Translation 1 T = ∅; 퓠퓟풑풂풕풕풆풓풏 : who is the director of [Slot] 2 foreach entity mention eQ ∈ Q do 3 foreach e ∈ KB s.t. e.name==eQ do 퓠퓟풑풓풆풅풊풄풂풕풆 : Film.Film.Director 4 foreach predicate p ∈ KB related to e do 5 score = Sim(eQ, Q, RE p); 풒 : 6 if score > 0 then 7 q KB create a new triple query ; 8 q = {e, p, ?}; 풕 : 9 {Ai} = AnswerRetrieve(q, KB); 10 foreach A ∈ {Ai} do Figure 2: QP-based question translation example 11 create a new formal triple t; 12 t = {q.esbj , q.p, A}; 13 t.score = score; 14 insert t to T ; Lastly, annotators try to manually label the most- 15 end 16 end frequent 50,000 query patterns with their corre- 17 end sponding predicates, and 4,764 question patterns 18 end with single labeled predicates are obtained. 19 end From experiments (Table 3 in Section 4.3) we 20 sort T based on the score of each t ∈ T ; can see that, question pattern based question trans- 21 return T . lation can achieve high end-to-end accuracy. But as human efforts are needed in the mining proce- q = {e, p, ?}. If this score is larger than 0, which dure, this method cannot be extended to large scale means there are overlaps between Q’s context and very easily. Besides, different users often type the RE , then q will be used as the triple query of Q, questions with the same meaning in different NL p and a set of formal triples will be generated based expressions. For example, although the question q KB Forrest Gump was directed by which moviemaker on and (from Line 7 to Line 15). The compu- tation of Sim(e , Q, RE ) is defined as follows: means the same as the question Q in Figure 2, no Q p question pattern can cover it. We need to find an X 1 X alternative way to alleviate such coverage issue. ·{ P (ωn|REp)} |Q| − n + 1 T n ωn∈Q,ωn eQ=φ 2.3.2 Relation Expression-based Translation n Aiming to alleviate the coverage issue occurring in where is the n-gram order which ranges from 1 ω Q QP-based method, an alternative relation expres- to 5, n is an n-gram occurring in without over- e sion (RE) -based method is proposed, and will be lapping with Q and containing at least one con- P (ω |RE ) used when the QP-based method fails. tent word, n p is the posterior probability which is computed by: We define REp as a relation expression set for a given KB predicate p ∈ KB. Each relation ex- Count(ωn, REp) pression RE ∈ REp includes an expression string P (ωn|REp) = P 0 0 Count(ω , REp) ωn∈REp n REexpression, which must contain at least one con- tent word, and a weight RE , which denotes weight Count(ω, REp) denotes the weighted sum of the confidence that REexpression can represent p’s times that ω occurs in REp: meaning in NL. For example, is the director of is one relation expression string for the predicate X Count(ω, REp) = {#ω(RE) · REweight} Film.Film.Director, which means it is usually used RE∈REp to express this relation (predicate) in NL. Algorithm 3 shows how to generate triples for where #ω(RE) denotes the number of times that a question Q based on relation expressions. For ω occurs in REexpression, and REweight is decided each possible entity mention eQ ∈ Q and a K- by the relation expression extraction component. B predicate p ∈ KB that is related to a KB enti- Figure 3 gives an example, where n-grams with ty e whose name equals eQ, Sim(eQ, Q, REp) is rectangles are the ones that occur in both Q’s con- computed (Line 5) based on the similarity between text and the relation expression set of a given pred- question context and REp, which measures how icate p = F ilm.F ilm.Director. Unlike the QP- likely Q can be transformed into a triple query based method which needs a perfect match, the 퓠 : Forrest Gump was directed by which moviemaker 2.3.3 Question Decomposition Sometimes, a question may provide multiple con- 퓡퓔푭풊풍풎.푭풊풍풎.푫풊풓풆풄풕풐풓 : is directed by was directed and written by straints to its answers. movie starred by Tom Han- is the moviemaker of ks in 1994 is one such question. All the films as was famous as the director of the answers of this question should satisfy the fol- … lowing two constraints: (1) starred by Tom Hanks; and (2) released in 1994. It is easy to see that such 풒 : questions cannot be translated to single triples. KB 풕 : We propose a dependency tree-based method to handle such multiple-constraint questions by (i) Figure 3: RE-based question translation example decomposing the original question into a set of sub-questions using syntax-based patterns; and (ii) intersecting the answers of all sub-questions as the RE-based method allows fuzzy matching between final answers of the original question. Note, ques- Q and REp, and records this (Line 13) in generat- tion decomposition only operates on the original ed triples, which is used as features later. question and question spans covered by complete Relation expressions are mined as follows: Giv- dependency subtrees. Four syntax-based patterns en a set of KB assertions with an identical predi- (Figure 5) are used for question decomposition. If cate p, we first extract all sentences from English a question matches any one of these patterns, then Wiki pages4, each of which contains at least one sub-questions are generated by collecting the path- pair of entities occurring in one assertion. Then, s between n0 and each ni(i > 0) in the pattern, we extract the shortest path between paired entities where each n denotes a complete subtree with a in the dependency tree of each sentence as an RE noun, number, or question word as its root node, candidate for the given predicate. The intuition is the symbol ∗ above prep∗ denotes this preposition that any sentence containing such entity pairs oc- can be skipped in matching. For the question men- cur in an assertion is likely to express the predi- tioned at the beginning, its two sub-questions gen- cate of that assertion in some way. Last, all rela- erated are movie starred by Tom Hanks and movie tion expressions extracted are filtered by heuristic starred in 1994, as its dependency form matches rules, i.e., the frequency must be larger than 4, the pattern (a). Similar ideas are used in IBM Wat- length must be shorter than 10, and then weighted son (Kalyanpur et al., 2012) as well. by the pattern scoring methods proposed in (Ger- ber and Ngomo, 2011; Gerber and Ngomo, 2012). 푣푒푟푏 푣푒푟푏 For each predicate, we only keep the relation ex- 푛 푛 푝푟푒푝 푛 푝푟푒푝∗ pressions whose pattern scores are larger than a 0 1 0 pre-defined threshold. Figure 4 gives one relation 푛2 푛1 and 푛2 expression extraction example. The statistics and (a) (b) overall quality of the relation expressions are list- ed in Section 4.1. 푛0 푣푒푟푏 ∗ 푣푒푟푏 푛0 푝푟푒푝 and 푣푒푟푏

Paired entity of a {Forrest Gump, Robert Zemeckis} 푝푟푒푝∗ … 푝푟푒푝∗ 푛 푝푟푒푝∗ KB predicate {Titanic, James Cameron} 1 푝=Film.Film.Director {The Dark Knight Rises, Christopher Nolan} 푛1 … 푛푘 푛2

Robert Zemeckis is the director of Forrest Gump Passage retrieval James Cameron is the moviemaker of Titanic (c) (d) from Wiki pages The Dark Knight Rises is directed by Christopher Nolan

is the director of ||| 0.25 Figure 5: Four syntax-based patterns for question Relation expression is the moviemaker of ||| 0.23 weighting is directed by ||| 0.20 decomposition

Figure 4: RE extraction example As dependency parsing is not perfect, we gen- erate single triples for such questions without con- sidering constraints as well, and add them to the 4 http://en.wikipedia.org/wiki/Wikipedia:Database download search space for competition. hsyntax constraint(·) is used to boost triples that are converted from sub- • hstaticranksbj (·), which sums the static rank questions generated by question decomposition. scores of all subject entities in D’s triple set The more constraints an answer satisfies, the bet- as P t .e .static rank. ti∈D i sbj ter. Obviously, current patterns used can’t cover • h (·), which sums the static rank all cases but most-common ones. We leave a more staticrankobj general pattern mining method for future work. scores of all object entities in D’s triple set as P t .e .static rank. ti∈D i obj 2.4 Feature Design • hconfidenceobj (·), which sums the confidence The objective of our KB-QA system is to seek the scores of all object entities in D’s triple set as derivation hDˆ, Aiˆ that maximizes the probability P t.eobj.confidence. P (hD, Ai|KB, Q) described in Section 2.1 as: t∈D For each assertion {esbj, p, eobj} stored in KB, hDˆ, Aiˆ = argmax P (hD, Ai|KB, Q) esbj.static rank and eobj.static rank denote the hD,Ai∈H(Q) 5 static rank scores for esbj and eobj respectively; M X eobj.confidence rank represents the probability = argmax λi · hi(hD, Ai, KB, Q) p(eobj|esbj, p). These three scores are used as fea- hD,Ai∈H(Q) i=1 tures to rank answers generated in QA procedure. We now introduce the feature sets {hi(·)} that are used in the above linear model: 2.5 Feature Weight Tuning ref Given a set of question-answer pairs {Qi, Ai } • hquestion word(·), which counts the number of as the development (dev) set, we use the minimum original question words occurring in A. It pe- error rate training (MERT) (Och, 2003) algorithm nalizes those partially answered questions. M to tune the feature weights λi in our proposed model. The training criterion is to seek the feature • hspan(·), which counts the number of spans in Q that are converted to formal triples. It weights that can minimize the accumulated errors controls the granularity of the spans used in of the top-1 answer of questions in the dev set: question translation. N ˆM X ref ˆ M λ1 = argmin Err(Ai , Ai; λ1 ) • hsyntax subtree(·), which counts the number M λ1 i=1 of spans in Q that are (1) converted to formal triples, whose predicates are not Null, and ref N is the number of questions in the dev set, Ai (2) covered by complete dependency subtrees is the correct answers as references of the ith ques- at the same time. The underlying intuition tion in the dev set, Aˆi is the top-1 answer candi- is that, dependency subtrees of Q should be date of the ith question in the dev set based on treated as units for question translation. M feature weights λ1 , Err(·) is the error function which is defined as: • hsyntax constraint(·), which counts the num- ber of triples in D that are converted from ref ˆ M ref ˆ Err(Ai , Ai; λ1 ) = 1 − δ(Ai , Ai) sub-questions generated by the question de- ref ˆ composition component. where δ(Ai , Ai) is an indicator function which equals 1 when Aˆ is included in the reference set • h (·), which counts the number of triples i triple Aref , and 0 otherwise. in D, whose predicates are not Null. i 3 Comparison with Previous Work • htripleweight (·), which sums the scores of all triples {t } in D as P t .score. i ti∈D i Our work intersects with two research directions: semantic parsing and question answering. • h (·), which counts the number of QPcount Some previous works on semantic pars- triples in D that are generated by QP-based ing (Zelle and Mooney, 1996; Zettlemoyer and question translation method. Collins, 2005; Wong and Mooney, 2006; Zettle- moyer and Collins, 2007; Wong and Mooney, • hREcount (·), which counts the number of triples in D that are generated by RE-based 5The static rank score of an entity represents a general question translation method. indicator of the overall quality of that entity. 2007; Kwiatkowski et al., 2010; Kwiatkowski best translations, which are used for finding simi- et al., 2011) require manually annotated logical lar sentences in the document collection that prob- forms as supervision, and are hard to extend result- ably contain answers. Echihabi and Marcu (2003) ing parsers from limited domains, such as GEO, have developed a noisy-channel model for QA, JOBS and ATIS, to open domains. Recent work- which explains how a sentence containing an an- s (Clarke and Lapata, 2010; Liang et al., 2013) swer to a given question can be rewritten into that have alleviated such issues using question-answer question through a sequence of stochastic opera- pairs as weak supervision, but still with the short- tions. Compared to the above two MT-motivated coming of using limited lexical triggers to link NL QA work, our method uses MT methodology to phrases to predicates. Poon (2013) has proposed translate questions to answers directly. an unsupervised method by adopting grounded- learning to leverage the database for indirect su- 4 Experiment pervision. But transformation from NL questions 4.1 Data Sets to MRs heavily depends on dependency parsing results. Besides, the KB used (ATIS) is limited as Following Berant et al. (2013), we use the same well. Kwiatkowski et al. (2013) use Wiktionary subset of WEBQUESTIONS (3,778 questions) as and a limited manual lexicon to map POS tags to the development set (Dev) for weight tuning in a set of predefined CCG lexical categories, which MERT, and use the other part of WEBQUES- aims to reduce the need for learning lexicon from TIONS (2,032 questions) as the test set (Test). Ta- training data. But it still needs human efforts to de- ble 1 shows the statistics of this data set. fine lexical categories, which usually can not cover Data Set # Questions # Words all the semantic phenomena. WEBQUESTIONS 5,810 6.7 Berant et al. (2013) have not only enlarged the Table 1: Statistics of evaluation set. # Questions is KB used for Freebase (Google, 2013), but also the number of questions in a data set, # Words is used a bigger lexicon trigger set extracted by the the averaged word count of a question. open IE method (Lin et al., 2012) for NL phrases to predicates linking. In comparison, our method Table 2 shows the statistics of question patterns has further advantages: (1) Question answering and relation expressions used in our KB-QA sys- and semantic parsing are performed in an join- tem. As all question patterns are collected with hu- t way under a unified framework; (2) A robust man involvement as we discussed in Section 2.3.1, method is proposed to map NL questions to their the quality is very high (98%). We also sample formal triple queries, which trades off the mapping 1,000 instances from the whole relation expression quality by using question patterns and relation ex- set and manually label their quality. The accuracy pressions in a cascaded way; and (3) We use do- is around 89%. These two resources can cover 566 main independent feature set which allowing us to head predicates in our KB. use a relatively small number of question-answer pairs to tune model parameters. # Entries Accuracy Fader et al. (2013) map questions to formal Question Patterns 4,764 98% Relation Expressions 133,445 89% (triple) queries over a large scale, open-domain database of facts extracted from a raw corpus by Table 2: Statistics of question patterns and relation ReVerb (Fader et al., 2011). Compared to their expressions. work, our method gains an improvement in two aspects: (1) Instead of using facts extracted us- ing the open IE method, we leverage a large scale, 4.2 KB-QA Systems high-quality knowledge base; (2) We can han- Since Berant et al. (2013) is one of the latest dle multiple-relation questions, instead of single- work which has reported QA results based on a relation queries only, based on our translation large scale, general domain knowledge base (Free- based KB-QA framework. base), we consider their evaluation result on WE- Espana-Bonet and Comas (2012) have proposed BQUESTIONS as our baseline. an MT-based method for factoid QA. But MT in Our KB-QA system generates the k-best deriva- there work means to translate questions into n- tions for each question span, where k is set to 20. The answers with the highest model scores are • Question patterns are used to map NL context considered the best answers for evaluation. For to KB predicates. Context can be either con- evaluation, we follow Berant et al. (2013) to al- tinuous or discontinues phrases. Although low partial credit and score an answer using the F1 the size of this set is limited, they can actually measure, comparing the predicted set of entities to cover head questions/queries6 very well. The the annotated set of entities. underlying intuition of using patterns is that One difference between these two systems is the those high-frequent questions/queries should KB used. Since Freebase is completely contained and can be treated and solved in the QA task, by our KB, we disallow all entities which are not by involving human effort at a relative small included by Freebase. By doing so, our KB pro- price but with very impressive accuracy. vides the same knowledge as Freebase does, which means we do not gain any extra advantage by us- In order to figure out the impacts of question ing a larger KB. But we still allow ourselves to patterns and relation expressions, another exper- use the static rank scores and confidence scores of iment (Table 4) is designed to evaluate their in- entities as features, as we described in Section 2.4. dependent influences, where QPonly and REonly denote the results of KB-QA systems which only 4.3 Evaluation Results allow question patterns and relation expressions in We first show the overall evaluation results of our question translation respectively. KB-QA system and compare them with baseline’s Settings Test (Accuracy) Test (Precision) results on Dev and Test. Note that we do not re- QPonly 11.8% 97.5% implement the baseline system, but just list their RE only 32.5% 73.2% evaluation numbers reported in the paper. Com- parison results are listed in Table 3. Table 4: Impacts of question patterns and relation expressions. P recision is defined as the num- Dev (Accuracy) Test (Accuracy) ber of correctly answered questions divided by the Baseline 32.9% 31.4% number of questions with non-empty answers gen- Our Method 42.5% (+9.6%) 37.5% (+6.1%) erated by our KB-QA system. Table 3: Accuracy on evaluation sets. Accuracy is From Table 4 we can see that the accuracy of defined as the number of correctly answered ques- RE on Test (32.5%) is slightly better than tions divided by the total number of questions. only baseline’s result (31.4%). We think this improve- ment comes from two aspects: (1) The quality of Table 3 shows our KB-QA method outperforms the relation expressions is better than the quality baseline on both Dev and Test. We think the po- of the lexicon entries used in the baseline; and tential reasons of this improvement include: (2) We use the extraction-related statistics of re- • Different methods are used to map NL phras- lation expressions as features, which brings more es to KB predicates. Berant et al. (2013) information to measure the confidence of map- have used a lexicon extracted from a subset ping between NL phrases and KB predicates, and of ReVerb triples (Lin et al., 2012), which makes the model to be more flexible. Meanwhile, is similar to the relation expression set used QPonly perform worse (11.8%) than REonly, due in question translation. But as our relation to coverage issue. But by comparing the precision- expressions are extracted by an in-house ex- s of these two settings, we find QPonly (97.5%) tractor, we can record their extraction-related outperforms REonly (73.2%) significantly, due to statistics as extra information, and use them its high quality. This means how to extract high- as features to measure the mapping quality. quality question patterns is worth to be studied for Besides, as a portion of entities in our KB the question answering task. are extracted from Wiki, we know the one- As the performance of our KB-QA system re- to-one correspondence between such entities lies heavily on the k-best beam approximation, we and Wiki pages, and use this information in evaluate the impact of the beam size and list the relation expression extraction for entity dis- comparison results in Figure 6. We can see that as ambiguation. A lower disambiguation error 6Head questions/queries mean the questions/queries with rate results in better relation expressions. high frequency and clear patterns. we increase k incrementally, the accuracy increase Nelson and 2012, all the others are non-content at the same time. However, a larger k (e.g. 200) words. cannot bring significant improvements comparing Besides, ambiguous entries contained in rela- to a smaller one (e.g., 20), but using a large k has tion expression sets of different predicates can a tremendous impact on system efficiency. So we bring mapping errors as well. For the follow- choose k = 20 as the optimal value in above ex- ing question who did Steve Spurrier play pro periments, which trades off between accuracy and football for? as an example, since the unigram efficiency. play exists in both Film.Film.Actor and Ameri- can Football.Player.Current Team ’s relation ex- Accuracy on Test 0.45 pression sets, we made a wrong prediction, which 0.4 0.35 led to wrong answers. 0.3 0.25 0.2 4.4.3 Specific Questions 0.15 0.1 Sometimes, we cannot give exact answers to 0.05 0 superlative questions like what is the first book 5 20 50 100 200 Accuracy Sherlock Holmes appeared in?. For this example, we can give all book names where Sherlock Figure 6: Impacts of beam size on accuracy. Holmes appeared in, but we cannot rank them based on their publication date , as we cannot learn the alignment between the constraint word Actually, the size of our system’s search space first occurred in the question and the predicate is much smaller than the one of the semantic parser Book.Written Work.Date Of First Publication used in the baseline.This is due to the fact that, if from training data automatically. Although we triple queries generated by the question translation have followed some work (Poon, 2013; Liang component cannot derive any answer from KB, we et al., 2013) to handle such special linguistic will discard such triple queries directly during the phenomena by defining some specific operators, QA procedure. We can see that using a small k it is still hard to cover all unseen cases. We leave can achieve better results than baseline, where the this to future work as an independent topic. beam size is set to be 200.

4.4 Error Analysis 5 Conclusion and Future Work 4.4.1 Entity Detection This paper presents a translation-based KB-QA Since named entity recognizers trained on Penn method that integrates semantic parsing and QA TreeBank usually perform poorly on web queries, in one unified framework. Comparing to the base- We instead use a simple string-match method to line system using an independent semantic parser detect entity mentions in the question using a with state-of-the-art performance, we achieve bet- cleaned entity dictionary dumped from our KB. ter results on a general domain evaluation set. One problem of doing so is the entity detection Several directions can be further explored in the issue. For example, in the question who was Es- future: (i) We plan to design a method that can ther’s husband ?, we cannot detect Esther as an extract question patterns automatically, using ex- entity, as it is just part of an entity name. We need isting labeled question patterns and KB as weak an ad-hoc entity detection component to handle supervision. As we discussed in the experiment such issues, especially for a web scenario, where part, how to mine high-quality question patterns is users often type entity names in their partial or ab- worth further study for the QA task; (ii) We plan breviation forms. to integrate an ad-hoc NER into our KB-QA sys- tem to alleviate the entity detection issue; (iii) In 4.4.2 Predicate Mapping fact, our proposed QA framework can be general- Some questions lack sufficient evidences to detec- ized to other intelligence besides knowledge bases t predicates. where is Byron Nelson 2012 ? is an as well. Any method that can generate answers to example. Since each relation expression must con- questions, such as the Web-based QA approach, tain at least one content word, this question cannot can be integrated into this framework, by using match any relation expression. Except for Byron them in the question translation component. References Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke S. Zettlemoyer. 2013. Scaling seman- Yoav Artzi and Luke S. Zettlemoyer. 2011. Boot- tic parsers with on-the-fly ontology matching. In strapping semantic parsers from conversations. In EMNLP, pages 1545–1556. EMNLP, pages 421–432. Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Yoav Artzi, Nicholas FitzGerald, and Luke S. Zettle- Learning dependency-based compositional seman- moyer. 2013. Semantic parsing with combinatory tics. In ACL, pages 590–599. categorial grammars. In ACL (Tutorial Abstracts), page 2. Percy Liang, Michael I. Jordan, and Dan Klein. 2013. Learning dependency-based compositional seman- Jonathan Berant, Andrew Chou, Roy Frostig, and Per- tics. Computational Linguistics, 39(2):389–446. cy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP, pages 1533– Thomas Lin, Mausam, and Oren Etzioni. 2012. Entity 1544. linking at web scale. In AKBC-WEKEX, pages 84– 88. Qingqing Cai and Alexander Yates. 2013. Large-scale Raymond J. Mooney. 2007. Learning for semantic semantic parsing via schema matching and lexicon parsing. In CICLing, pages 311–324. extension. In ACL, pages 423–433. Franz Josef Och. 2003. Minimum error rate training in James Clarke and Mirella Lapata. 2010. Discourse statistical machine translation. In ACL, pages 160– constraints for document compression. Computa- 167. tional Linguistics, 36(3):411–441. Hoifung Poon. 2013. Grounded unsupervised seman- Abdessamad Echihabi and Daniel Marcu. 2003. A tic parsing. In ACL, pages 933–943. noisy-channel approach to question answering. In ACL. Yuk Wah Wong and Raymond J. Mooney. 2006. Learning for semantic parsing with statistical ma- Cristina Espana-Bonet and Pere R. Comas. 2012. Full chine translation. In HLT-NAACL. machine translation for factoid question answering. In EACL, pages 20–29. Yuk Wah Wong and Raymond J. Mooney. 2007. Learning synchronous grammars for semantic pars- Anthony Fader, Stephen Soderland, and Oren Etzion- ing with lambda calculus. In ACL. i. 2011. Identifying relations for open information extraction. In EMNLP, pages 1535–1545. John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to parse database queries using inductive . In AAAI/IAAI, Vol. 2, pages 1050– Anthony Fader, Luke S. Zettlemoyer, and Oren Etzioni. 1055. 2013. Paraphrase-driven learning for open question answering. In ACL, pages 1608–1618. Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Struc- Daniel Gerber and Axel-Cyrille Ngonga Ngomo. 2011. tured classification with probabilistic categorial Bootstrapping the linked data web. In ISWC. grammars. In UAI, pages 658–666. Daniel Gerber and Axel-Cyrille Ngonga Ngomo. 2012. Luke S. Zettlemoyer and Michael Collins. 2007. On- Extracting multilingual natural-language patterns line learning of relaxed ccg grammars for parsing to for rdf predicates. In ESWC. logical form. In EMNLP-CoNLL, pages 678–687.

Google. 2013. Freebase. In http://www.freebase.com.

Aditya Kalyanpur, Siddharth Patwardhan, Branimir Boguraev, Adam Lally, and Jennifer Chu-Carroll. 2012. Fact-based question decomposition in deep- qa. IBM Journal of Research and Development, 56(3):13.

Tom Kwiatkowski, Luke S. Zettlemoyer, Sharon Gold- water, and Mark Steedman. 2010. Inducing proba- bilistic ccg grammars from logical form with higher- order unification. In EMNLP, pages 1223–1233.

Tom Kwiatkowski, Luke S. Zettlemoyer, Sharon Gold- water, and Mark Steedman. 2011. Lexical general- ization in ccg grammar induction for semantic pars- ing. In EMNLP, pages 1512–1523.