<<

Knowledge-Based as Machine Translation

Junwei Bao† ∗, Nan Duan‡ , Ming Zhou‡ , Tiejun Zhao† †Harbin Institute of Technology ‡Microsoft Research [email protected] nanduan, mingzhou @microsoft.com { [email protected]}

Abstract 2013; Poon, 2013; Artzi et al., 2013; Kwiatkowski et al., 2013; Berant et al., 2013); Then, the answer- A typical knowledge-based question an- s are retrieved from existing KBs using generated swering (KB-QA) system faces two chal- MRs as queries. lenges: one is to transform natural lan- Unlike existing KB-QA systems which treat se- guage questions into their meaning repre- mantic parsing and answer retrieval as two cas- sentations (MRs); the other is to retrieve caded tasks, this paper presents a unified frame- answers from knowledge bases (KBs) us- work that can integrate semantic parsing into the ing generated MRs. Unlike previous meth- question answering procedure directly. Borrow- ods which treat them in a cascaded man- ing ideas from machine translation (MT), we treat ner, we present a translation-based ap- the QA task as a translation procedure. Like MT, proach to solve these two tasks in one u- CYK parsing is used to parse each input question, nified framework. We translate questions and answers of the span covered by each CYK cel- to answers based on CYK parsing. An- l are considered the translations of that cell; un- swers as translations of the span covered like MT, which uses offline-generated translation by each CYK cell are obtained by a ques- tables to translate source phrases into target trans- tion translation method, which first gener- lations, a semantic parsing-based question trans- ates formal triple queries as MRs for the lation method is used to translate each span into span based on question patterns and re- its answers on-the-fly, based on question patterns lation expressions, and then retrieves an- and relation expressions. The final answers can be swers from a given KB based on triple obtained from the root cell. Derivations generated queries generated. A linear model is de- during such a translation procedure are modeled fined over derivations, and minimum er- by a linear model, and minimum error rate train- ror rate training is used to tune feature ing (MERT) (Och, 2003) is used to tune feature weights based on a set of question-answer weights based on a set of question-answer pairs. pairs. Compared to a KB-QA system us- Figure 1 shows an example: the question direc- ing a state-of-the-art semantic parser, our tor of movie starred by Tom Hanks is translated to method achieves better results. one of its answers Robert Zemeckis by three main director of director of 1 Introduction steps: (i) translate to ; (ii) translate movie starred by Tom Hanks to one of it- Knowledge-based question answering (KB-QA) s answers Forrest Gump; (iii) translate director of computes answers to natural language (NL) ques- Forrest Gump to a final answer Robert Zemeckis. tions based on existing knowledge bases (KBs). Note that the updated question covered by Cell[0, Most previous systems tackle this task in a cas- 6] is obtained by combining the answers to ques- caded manner: First, the input question is trans- tion spans covered by Cell[0, 1] and Cell[2, 6]. formed into its meaning representation (MR) by The contributions of this work are two-fold: (1) an independent semantic parser (Zettlemoyer and We propose a translation-based KB-QA method Collins, 2005; Mooney, 2007; Artzi and Zettle- that integrates semantic parsing and QA in one moyer, 2011; Liang et al., 2011; Cai and Yates, unified framework. The benefit of our method

This∗ work was finished while the author was visiting Mi- is that we don’t need to explicitly generate com- crosoft Research Asia. plete semantic structures for input questions. Be-

967 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 967–976, Baltimore, Maryland, USA, June 23-25 2014. c 2014 Association for Computational Linguistics

Hanks, Film.Actor.Film, Forrest Gump 6 and (iii) director of Forrest Gump ⟹ Robert Zemeckis i2 Forrest Gump, Film.Film.Director, Robert h Cell[0, 6] (ii) movie starred by Tom Hanks ⟹ Forrest Gump Zemeckis 6 are three ordered formal triples i0 (i) director of ⟹ director of corresponding to the three translation steps in Cell[2, 6] Figure 1. We define the task of transforming question spans into formal triples as question Cell[0, 1] translation. denotes one final answer of . A Q director of movie starred by Tom Hanks h ( ) denotes the ith feature function. • i · λ denotes the feature weight of h ( ). Figure 1: Translation-based KB-QA example • i i · According to the above description, our KB- QA method can be decomposed into four tasks as: sides which, answers generated during the transla- (1) search space generation for ( ); (2) ques- tion procedure help significantly with search space H Q tion translation for transforming question spans in- pruning. (2) We propose a robust method to trans- to their corresponding formal triples; (3) feature form single-relation questions into formal triple design for h ( ); and (4) feature weight tuning for queries as their MRs, which trades off between i · λ . We present details of these four tasks in the transformation accuracy and recall using question { i} following subsections one-by-one. patterns and relation expressions respectively.

2 Translation-Based KB-QA 2.2 Search Space Generation We first present our translation-based KB-QA 2.1 Overview method in Algorithm 1, which is used to generate Formally, given a and an N- ( ) for each input NL question . KB L question , our KB-QA method generates a set H Q Q Q of formal triples-answer pairs , as deriva- {hD Ai} Algorithm 1: Translation-based KB-QA tions, which are scored and ranked by the distribu- 1 for l = 1 to do |Q| tion P ( , , ) defined as follows: 2 for all i, j s.t. j i = l do hD Ai|KB Q j − 3 ( i ) = ; H Q ∅ j M 4 T = QT rans( , ); exp i=1 λi hi( , , , ) Qi KB { ·M hD Ai KB Q } 5 foreach formal triple t T do exp λ h ( 0 , 0 , , ) ∈ d 0 , 0 ( P) i=1 i i 6 create a new derivation ; hD A i∈H Q { · hD A i KB Q } 7 d. = t.eobj ; A P P 8 d. = t ; denotes a knowledge base1 that stores a D { } •KB 9 update the model score of d; set of assertions. Each assertion t is in j 10 insert d to ( i ); ID ID ∈ KB H Q the form of esbj, p, eobj , where p denotes 11 end {ID ID} 12 end a predicate, esbj and eobj denote the subject 13 2 end and object entities of t, with unique IDs . 14 for l = 1 to do |Q| 15 for all i, j s.t. j i = l do − ( ) denotes the search space , . 16 for all m s.t. i m < j do ≤ m j •H Q {hD Ai} D 17 for dl ( ) and dr ( ) do is composed of a set of ordered formal triples ∈ H Qi ∈ H Qm+1 18 update = dl. + dr. ; j Q A A t1, ..., tn . Each triple t = esbj, p, eobj i 19 T = QT rans( update, ); { } { } ∈ Q KB denotes an assertion in , where i and 20 foreach formal triple t T do D KB ∈ j denotes the beginning and end indexes of 21 create a new derivation d; 22 d. = t.eobj ; A S S the question span from which t is trans- 23 d. = dl. dr. t ; D D D { } formed. The order of triples in denotes 24 update the model score of d; j D 25 insert d to ( ); the order of translation steps from to . H Qi Q A 26 E.g., director of, Null, director of 1, Tom end h i0 h 27 end 1We use a large scale knowledge base in this paper, which 28 end contains 2.3B entities, 5.5K predicates, and 18B assertions. A 29 end 16-machine cluster is used to host and serve the whole data. 30 end 2 Each KB entity has a unique ID. For the sake of conve- 31 return ( ). nience, we omit the ID information in the rest of the paper. H Q

968 The first half (from Line 1 to Line 13) gen- Algorithm 2: -based Question Translation QP erates a formal triple set T for each unary span 1 T = ; j ∅ , using the question translation method 2 foreach entity mention e do i Q ∈ Q Q ∈ Q j j 3 pattern = replace e in with [Slot]; QT rans( , ) (Line 4), which takes as the Q Q Q i i 4 foreach question pattern do Q KB Q QP input. Each triple t T returned is in the form of 5 if pattern == pattern then ∈ j Q QP 6 = Disambiguate(e , predicate); esbj, p, eobj , where esbj’s mention occurs in i , E Q QP { } Q 7 foreach e do ∈ E p is a predicate that denotes the meaning expressed 8 create a new triple query q; j by the context of e in , e is an answer of 9 q = e, predicate, ? ; sbj i obj { QP } j Q 10 i = AnswerRetrieve(q, ); based on esbj, p and . We describe the im- {A } KB i 11 foreach i do Q KB A ∈ {A } plementation detail of QT rans( ) in Section 2.3. 12 create a new formal triple t; · 13 t = q.esbj , q.p, ; The second half (from Line 14 to Line 31) first { A} 14 t.score = 1.0; updates the content of each bigger span j by con- Qi 15 insert t to T ; catenating the answers to its any two consecutive 16 end j smaller spans covered by (Line 18). Then, 17 end Qi QT rans( j, ) is called to generate triples for 18 end Qi KB 19 end the updated span (Line 19). The above operations 20 end are equivalent to answering a simplified question, 21 return T . which is obtained by replacing the answerable spans in the original question with their corre- sponding answers. The search space ( ) for the H Q symbol [Slot], and a KB predicate predicate, entire question is returned at last (Line 31). QP Q which denotes the meaning expressed by the con- text words in . 2.3 Question Translation QPpattern Algorithm 2 shows how to generate formal The purpose of question translation is to translate triples for a span based on question pattern- a span to a set of formal triples T . Each triple Q Q s ( -based question translation). For each en- t T is in the form of e , p, e , where e ’s QP ∈ { sbj obj} sbj tity mention e , we replace it with [Slot] mention3 occurs in , p is a predicate that denotes Q ∈ Q Q and obtain a pattern string pattern (Line 3). If the meaning expressed by the context of e in Q sbj pattern can match one pattern, then we con- , e is an answer to retrieved from us- Q QP Q obj Q KB struct a triple query q (Line 9) using predicate ing a triple query q = e , p, ? . Note that if QP { sbj } as its predicate and one of the KB entities re- no predicate p or answer e can be generated, obj turned by Disambiguate(e , predicate) as it- , Null, will be returned as a special triple, Q QP {Q Q} s subject entity (Line 6). Here, the objective of which sets e to be itself, and p to be Null. obj Disambiguate(e , predicate) is to output a set Q Q QP This makes sure the un-answerable spans can be of disambiguated KB entities in . The name E KB passed on to the higher-level operations. of each entity returned equals the input entity Question translation assumes each span is a mention e and occurs in some assertions where Q Q single-relation question (Fader et al., 2013). Such are the predicates. The underlying QPpredicate assumption simplifies the efforts of semantic pars- idea is to use the context (predicate) information to ing to the minimum question units, while leaving help entity disambiguation. The answers of q are the capability of handling multiple-relation ques- returned by AnswerRetrieve(q, ) based on q KB tions (Figure 1 gives one such example) to the out- and (Line 10), each of which is used to con- KB er CYK-parsing based translation procedure. Two struct a formal triple and added to T for (from Q question translation methods are presented in the Line 11 to Line 16). Figure 2 gives an example. rest of this subsection, which are based on ques- Question patterns are collected as follows: First, tion patterns and relation expressions respectively. 5W queries, which begin with What, Where, Who, When, or Which, are selected from a large scale 2.3.1 Question Pattern-based Translation query log of a commercial search engine; Then, a A question pattern includes a pattern string QP cleaned entity dictionary is used to annotate each , which is composed of words and a slot QPpattern query by replacing all entity mentions it contains 3For simplicity, a cleaned entity dictionary dumped from with the symbol [Slot]. Only high-frequent query the entire is used to detect entity mentions in . patterns which contain one [Slot] are maintained; KB Q

969 퓠 : who is the director of Forrest Gump Algorithm 3: -based Question Translation RE 1 T = ; ∅ 퓠퓟풑풂풕풕풆풓풏 : who is the director of [Slot] 2 foreach entity mention e do Q ∈ Q 3 foreach e s.t. e.name==e do 퓠퓟풑풓풆풅풊풄풂풕풆 : Film.Film.Director ∈ KB Q 4 foreach predicate p related to e do ∈ KB 5 score = Sim(e , , p); Q Q RE 풒 : 6 if score > 0 then KB 7 create a new triple query q; 8 q = e, p, ? ; { } 풕 : 9 i = AnswerRetrieve(q, ); {A } KB 10 foreach i do A ∈ {A } t Figure 2: -based question translation example 11 create a new formal triple ; 12 t = q.esbj , q.p, ; QP { A} 13 t.score = score; 14 insert t to T ; Lastly, annotators try to manually label the most- 15 end 16 end frequent 50,000 query patterns with their corre- 17 end sponding predicates, and 4,764 question patterns 18 end with single labeled predicates are obtained. 19 end 20 sort T based on the score of each t T ; From experiments (Table 3 in Section 4.3) we ∈ can see that, question pattern based question trans- 21 return T . lation can achieve high end-to-end accuracy. But as human efforts are needed in the mining proce- q = e, p, ? . If this score is larger than 0, which dure, this method cannot be extended to large scale { } means there are overlaps between ’s context and very easily. Besides, different users often type the Q , then q will be used as the triple query of , questions with the same meaning in different NL REp Q and a set of formal triples will be generated based expressions. For example, although the question on q and (from Line 7 to Line 15). The compu- Forrest Gump was directed by which moviemaker KB tation of Sim(e , , p) is defined as follows: means the same as the question in Figure 2, no Q Q RE Q question pattern can cover it. We need to find an 1 alternative way to alleviate such coverage issue. P (ωn p) n + 1 ·{ T |RE } n |Q| − ωn ,ωn e =φ X ∈Q X Q 2.3.2 Relation Expression-based Translation n Aiming to alleviate the coverage issue occurring in where is the n-gram order which ranges from 1 to 5, ω is an n-gram occurring in without over- -based method, an alternative relation expres- n Q QP lapping with e and containing at least one con- sion ( ) -based method is proposed, and will be Q RE tent word, P (ω ) is the posterior probability used when the -based method fails. n|REp QP which is computed by: We define as a relation expression set for REp a given KB predicate p . Each relation ex- ∈ KB Count(ωn, p) pression includes an expression string P (ωn p) = RE p |RE Count(ω0 , ) RE ∈ RE ω0 p n p , which must contain at least one con- n∈RE RE REexpression tent word, and a weight , which denotes P REweight Count(ω, p) denotes the weighted sum of the confidence that expression can represent p’s RE RE times that ω occurs in p: meaning in NL. For example, is the director of RE is one relation expression string for the predicate Count(ω, p) = #ω( ) weight Film.Film.Director, which means it is usually used RE { RE · RE } p to express this relation (predicate) in NL. RE∈REX Algorithm 3 shows how to generate triples for where # ( ) denotes the number of times that ω RE a question based on relation expressions. For ω occurs in expression, and is decided Q RE REweight each possible entity mention e and a K- by the relation expression extraction component. Q ∈ Q B predicate p that is related to a KB enti- Figure 3 gives an example, where n-grams with ∈ KB ty e whose name equals e , Sim(e , , p) is rectangles are the ones that occur in both ’s con- Q Q Q RE Q computed (Line 5) based on the similarity between text and the relation expression set of a given pred- question context and , which measures how icate p = F ilm.F ilm.Director. Unlike the - REp QP likely can be transformed into a triple query based method which needs a perfect match, the Q

970 퓠 : Forrest Gump was directed by which moviemaker 2.3.3 Question Decomposition Sometimes, a question may provide multiple con- 퓡퓔푭풊풍풎.푭풊풍풎.푫풊풓풆풄풕풐풓 : is directed by was directed and written by straints to its answers. movie starred by Tom Han- is the moviemaker of ks in 1994 is one such question. All the films as was famous as the director of the answers of this question should satisfy the fol- … lowing two constraints: (1) starred by Tom Hanks; and (2) released in 1994. It is easy to see that such 풒 : KB questions cannot be translated to single triples. 풕 : We propose a dependency tree-based method to handle such multiple-constraint questions by (i) Figure 3: -based question translation example decomposing the original question into a set of RE sub-questions using syntax-based patterns; and (ii) intersecting the answers of all sub-questions as the -based method allows fuzzy matching between final answers of the original question. Note, ques- RE and , and records this (Line 13) in generat- tion decomposition only operates on the original Q REp ed triples, which is used as features later. question and question spans covered by complete Relation expressions are mined as follows: Giv- dependency subtrees. Four syntax-based patterns en a set of KB assertions with an identical predi- (Figure 5) are used for question decomposition. If cate p, we first extract all sentences from English a question matches any one of these patterns, then Wiki pages4, each of which contains at least one sub-questions are generated by collecting the path- pair of entities occurring in one assertion. Then, s between n0 and each ni(i > 0) in the pattern, we extract the shortest path between paired entities where each n denotes a complete subtree with a in the dependency tree of each sentence as an noun, number, or question word as its root node, RE candidate for the given predicate. The intuition is the symbol above prep∗ denotes this preposition ∗ that any sentence containing such entity pairs oc- can be skipped in matching. For the question men- cur in an assertion is likely to express the predi- tioned at the beginning, its two sub-questions gen- cate of that assertion in some way. Last, all rela- erated are movie starred by Tom Hanks and movie tion expressions extracted are filtered by heuristic starred in 1994, as its dependency form matches rules, i.e., the frequency must be larger than 4, the pattern (a). Similar ideas are used in IBM Wat- length must be shorter than 10, and then weighted son (Kalyanpur et al., 2012) as well. by the pattern scoring methods proposed in (Ger- ber and Ngomo, 2011; Gerber and Ngomo, 2012). 푣푒푟푏 푣푒푟푏 For each predicate, we only keep the relation ex- 푛 푛 푝푟푒푝 푛 푝푟푒푝∗ pressions whose pattern scores are larger than a 0 1 0 pre-defined threshold. Figure 4 gives one relation 푛2 푛1 and 푛2 expression extraction example. The statistics and (a) (b) overall quality of the relation expressions are list- 푣푒푟푏 ed in Section 4.1. 푛0 ∗ 푣푒푟푏 푛0 푝푟푒푝 and 푣푒푟푏

Paired entity of a {Forrest Gump, Robert Zemeckis} 푝푟푒푝∗ … 푝푟푒푝∗ 푛 푝푟푒푝∗ KB predicate {Titanic, James Cameron} 1 푝=Film.Film.Director {The Dark Knight Rises, Christopher Nolan} 푛1 … 푛푘 푛2

Robert Zemeckis is the director of Forrest Gump Passage retrieval James Cameron is the moviemaker of Titanic (c) (d) from Wiki pages The Dark Knight Rises is directed by Christopher Nolan

is the director of ||| 0.25 Figure 5: Four syntax-based patterns for question Relation expression is the moviemaker of ||| 0.23 weighting is directed by ||| 0.20 decomposition

Figure 4: extraction example As dependency parsing is not perfect, we gen- RE erate single triples for such questions without con- sidering constraints as well, and add them to the 4http://en.wikipedia.org/wiki/Wikipedia:Database download search space for competition. h ( ) syntax constraint ·

971 is used to boost triples that are converted from sub- h ( ), which sums the static rank • staticranksbj · questions generated by question decomposition. scores of all subject entities in ’s triple set D t .e .static rank The more constraints an answer satisfies, the bet- as ti i sbj . ter. Obviously, current patterns used can’t cover ∈D h P ( ), which sums the static rank all cases but most-common ones. We leave a more • staticrankobj · scores of all object entities in ’s triple set as general pattern mining method for future work. D t .e .static rank ti i obj . 2.4 Feature Design ∈D hP ( ), which sums the confidence • confidenceobj · The objective of our KB-QA system is to seek the scores of all object entities in ’s triple set as derivation ˆ, ˆ that maximizes the probability D hD Ai t t.eobj.confidence. P ( , , ) described in Section 2.1 as: ∈D hD Ai|KB Q ForP each assertion esbj, p, eobj stored in , ˆ, ˆ = argmax P ( , , ) { } KB hD Ai hD Ai|KB Q esbj.static rank and eobj.static rank denote the , ( ) 5 hD Ai∈H Q static rank scores for esbj and eobj respectively; M eobj.confidence rank represents the probability = argmax λi hi( , , , ) · hD Ai KB Q p(eobj esbj, p). These three scores are used as fea- , ( ) i=1 | hD Ai∈H Q X tures to rank answers generated in QA procedure. We now introduce the feature sets h ( ) that are { i · } used in the above linear model: 2.5 Feature Weight Tuning ref Given a set of question-answer pairs i, h ( ), which counts the number of {Q Ai } • question word · as the development (dev) set, we use the minimum original question words occurring in . It pe- A error rate training (MERT) (Och, 2003) algorithm nalizes those partially answered questions. M to tune the feature weights λi in our proposed h ( ), which counts the number of spans model. The training criterion is to seek the feature • span · in that are converted to formal triples. It weights that can minimize the accumulated errors Q controls the granularity of the spans used in of the top-1 answer of questions in the dev set: question translation. N ˆM ref ˆ M λ1 = argmin Err( i , i; λ1 ) hsyntax subtree( ), which counts the number λM A A • · 1 i=1 of spans in that are (1) converted to formal X Q triples, whose predicates are not Null, and N is the number of questions in the dev set, ref Ai (2) covered by complete dependency subtrees is the correct answers as references of the ith ques- at the same time. The underlying intuition tion in the dev set, ˆi is the top-1 answer candi- is that, dependency subtrees of should be th A Q date of the i question in the dev set based on treated as units for question translation. feature weights λM , Err( ) is the error function 1 · which is defined as: h ( ), which counts the num- • syntax constraint · ber of triples in that are converted from ref ˆ M ref ˆ D Err( i , i; λ1 ) = 1 δ( i , i) sub-questions generated by the question de- A A − A A composition component. where δ( ref , ˆ ) is an indicator function which Ai Ai equals 1 when ˆi is included in the reference set htriple( ), which counts the number of triples ref A • · , and 0 otherwise. in , whose predicates are not Null. Ai D h ( ), which sums the scores of all 3 Comparison with Previous Work • tripleweight · triples ti in as t ti.score. Our work intersects with two research directions: { } D i∈D semantic parsing and question answering. h ( ), whichP counts the number of • QPcount · Some previous works on semantic pars- triples in that are generated by -based D QP ing (Zelle and Mooney, 1996; Zettlemoyer and question translation method. Collins, 2005; Wong and Mooney, 2006; Zettle-

h count ( ), which counts the number of moyer and Collins, 2007; Wong and Mooney, • RE · triples in that are generated by -based 5 D RE The static rank score of an entity represents a general question translation method. indicator of the overall quality of that entity.

972 2007; Kwiatkowski et al., 2010; Kwiatkowski best translations, which are used for finding simi- et al., 2011) require manually annotated logical lar sentences in the document collection that prob- forms as supervision, and are hard to extend result- ably contain answers. Echihabi and Marcu (2003) ing parsers from limited domains, such as GEO, have developed a noisy-channel model for QA, JOBS and ATIS, to open domains. Recent work- which explains how a sentence containing an an- s (Clarke and Lapata, 2010; Liang et al., 2013) swer to a given question can be rewritten into that have alleviated such issues using question-answer question through a sequence of stochastic opera- pairs as weak supervision, but still with the short- tions. Compared to the above two MT-motivated coming of using limited lexical triggers to link NL QA work, our method uses MT methodology to phrases to predicates. Poon (2013) has proposed translate questions to answers directly. an unsupervised method by adopting grounded- learning to leverage the database for indirect su- 4 Experiment pervision. But transformation from NL questions 4.1 Data Sets to MRs heavily depends on dependency parsing results. Besides, the KB used (ATIS) is limited as Following Berant et al. (2013), we use the same well. Kwiatkowski et al. (2013) use Wiktionary subset of WEBQUESTIONS (3,778 questions) as and a limited manual lexicon to map POS tags to the development set (Dev) for weight tuning in a set of predefined CCG lexical categories, which MERT, and use the other part of WEBQUES- aims to reduce the need for learning lexicon from TIONS (2,032 questions) as the test set (Test). Ta- training data. But it still needs human efforts to de- ble 1 shows the statistics of this data set. fine lexical categories, which usually can not cover Data Set # Questions # Words all the semantic phenomena. WEBQUESTIONS 5,810 6.7 Berant et al. (2013) have not only enlarged the Table 1: Statistics of evaluation set. # Questions is KB used for Freebase (Google, 2013), but also the number of questions in a data set, # Words is used a bigger lexicon trigger set extracted by the the averaged word count of a question. open IE method (Lin et al., 2012) for NL phrases to predicates linking. In comparison, our method Table 2 shows the statistics of question patterns has further advantages: (1) Question answering and relation expressions used in our KB-QA sys- and semantic parsing are performed in an join- tem. As all question patterns are collected with hu- t way under a unified framework; (2) A robust man involvement as we discussed in Section 2.3.1, method is proposed to map NL questions to their the quality is very high (98%). We also sample formal triple queries, which trades off the mapping 1,000 instances from the whole relation expression quality by using question patterns and relation ex- set and manually label their quality. The accuracy pressions in a cascaded way; and (3) We use do- is around 89%. These two resources can cover 566 main independent feature set which allowing us to head predicates in our KB. use a relatively small number of question-answer pairs to tune model parameters. # Entries Accuracy Fader et al. (2013) map questions to formal Question Patterns 4,764 98% Relation Expressions 133,445 89% (triple) queries over a large scale, open-domain database of facts extracted from a raw corpus by Table 2: Statistics of question patterns and relation ReVerb (Fader et al., 2011). Compared to their expressions. work, our method gains an improvement in two aspects: (1) Instead of using facts extracted us- ing the open IE method, we leverage a large scale, 4.2 KB-QA Systems high-quality knowledge base; (2) We can han- Since Berant et al. (2013) is one of the latest dle multiple-relation questions, instead of single- work which has reported QA results based on a relation queries only, based on our translation large scale, general domain knowledge base (Free- based KB-QA framework. base), we consider their evaluation result on WE- Espana-Bonet and Comas (2012) have proposed BQUESTIONS as our baseline. an MT-based method for factoid QA. But MT in Our KB-QA system generates the k-best deriva- there work means to translate questions into n- tions for each question span, where k is set to 20.

973 The answers with the highest model scores are Question patterns are used to map NL context • considered the best answers for evaluation. For to KB predicates. Context can be either con- evaluation, we follow Berant et al. (2013) to al- tinuous or discontinues phrases. Although low partial credit and score an answer using the F1 the size of this set is limited, they can actually measure, comparing the predicted set of entities to cover head questions/queries6 very well. The the annotated set of entities. underlying intuition of using patterns is that One difference between these two systems is the those high-frequent questions/queries should KB used. Since Freebase is completely contained and can be treated and solved in the QA task, by our KB, we disallow all entities which are not by involving human effort at a relative small included by Freebase. By doing so, our KB pro- price but with very impressive accuracy. vides the same knowledge as Freebase does, which means we do not gain any extra advantage by us- In order to figure out the impacts of question ing a larger KB. But we still allow ourselves to patterns and relation expressions, another exper- use the static rank scores and confidence scores of iment (Table 4) is designed to evaluate their in- dependent influences, where and entities as features, as we described in Section 2.4. QPonly REonly denote the results of KB-QA systems which only 4.3 Evaluation Results allow question patterns and relation expressions in We first show the overall evaluation results of our question translation respectively. KB-QA system and compare them with baseline’s Settings Test (Accuracy) Test (Precision) results on Dev and Test. Note that we do not re- only 11.8% 97.5% QP only 32.5% 73.2% implement the baseline system, but just list their RE evaluation numbers reported in the paper. Com- parison results are listed in Table 3. Table 4: Impacts of question patterns and relation expressions. P recision is defined as the num- Dev (Accuracy) Test (Accuracy) ber of correctly answered questions divided by the Baseline 32.9% 31.4% number of questions with non-empty answers gen- Our Method 42.5% (+9.6%) 37.5% (+6.1%) erated by our KB-QA system. Table 3: Accuracy on evaluation sets. Accuracy is From Table 4 we can see that the accuracy of defined as the number of correctly answered ques- on Test (32.5%) is slightly better than tions divided by the total number of questions. REonly baseline’s result (31.4%). We think this improve- ment comes from two aspects: (1) The quality of Table 3 shows our KB-QA method outperforms the relation expressions is better than the quality baseline on both Dev and Test. We think the po- of the lexicon entries used in the baseline; and tential reasons of this improvement include: (2) We use the extraction-related statistics of re- Different methods are used to map NL phras- lation expressions as features, which brings more • es to KB predicates. Berant et al. (2013) information to measure the confidence of map- have used a lexicon extracted from a subset ping between NL phrases and KB predicates, and of ReVerb triples (Lin et al., 2012), which makes the model to be more flexible. Meanwhile, perform worse (11.8%) than , due is similar to the relation expression set used QPonly REonly in question translation. But as our relation to coverage issue. But by comparing the precision- s of these two settings, we find (97.5%) expressions are extracted by an in-house ex- QPonly outperforms (73.2%) significantly, due to tractor, we can record their extraction-related REonly statistics as extra information, and use them its high quality. This means how to extract high- as features to measure the mapping quality. quality question patterns is worth to be studied for Besides, as a portion of entities in our KB the question answering task. are extracted from Wiki, we know the one- As the performance of our KB-QA system re- to-one correspondence between such entities lies heavily on the k-best beam approximation, we and Wiki pages, and use this information in evaluate the impact of the beam size and list the relation expression extraction for entity dis- comparison results in Figure 6. We can see that as ambiguation. A lower disambiguation error 6Head questions/queries mean the questions/queries with rate results in better relation expressions. high frequency and clear patterns.

974 we increase k incrementally, the accuracy increase Nelson and 2012, all the others are non-content at the same time. However, a larger k (e.g. 200) words. cannot bring significant improvements comparing Besides, ambiguous entries contained in rela- to a smaller one (e.g., 20), but using a large k has tion expression sets of different predicates can a tremendous impact on system efficiency. So we bring mapping errors as well. For the follow- choose k = 20 as the optimal value in above ex- ing question who did Steve Spurrier play pro periments, which trades off between accuracy and football for? as an example, since the unigram efficiency. play exists in both Film.Film.Actor and Ameri- can Football.Player.Current Team ’s relation ex- Accuracy on Test 0.45 pression sets, we made a wrong prediction, which 0.4 0.35 led to wrong answers. 0.3 0.25 0.2 4.4.3 Specific Questions 0.15 0.1 Sometimes, we cannot give exact answers to 0.05 0 superlative questions like what is the first book 5 20 50 100 200 Accuracy Sherlock Holmes appeared in?. For this example, we can give all book names where Sherlock Figure 6: Impacts of beam size on accuracy. Holmes appeared in, but we cannot rank them based on their publication date , as we cannot learn the alignment between the constraint word Actually, the size of our system’s search space first occurred in the question and the predicate is much smaller than the one of the semantic parser Book.Written Work.Date Of First Publication used in the baseline.This is due to the fact that, if from training data automatically. Although we triple queries generated by the question translation have followed some work (Poon, 2013; Liang component cannot derive any answer from KB, we et al., 2013) to handle such special linguistic will discard such triple queries directly during the phenomena by defining some specific operators, QA procedure. We can see that using a small k it is still hard to cover all unseen cases. We leave can achieve better results than baseline, where the this to future work as an independent topic. beam size is set to be 200.

4.4 Error Analysis 5 Conclusion and Future Work 4.4.1 Entity Detection This paper presents a translation-based KB-QA Since named entity recognizers trained on Penn method that integrates semantic parsing and QA TreeBank usually perform poorly on web queries, in one unified framework. Comparing to the base- We instead use a simple string-match method to line system using an independent semantic parser detect entity mentions in the question using a with state-of-the-art performance, we achieve bet- cleaned entity dictionary dumped from our KB. ter results on a general domain evaluation set. One problem of doing so is the entity detection Several directions can be further explored in the issue. For example, in the question who was Es- future: (i) We plan to design a method that can ther’s husband ?, we cannot detect Esther as an extract question patterns automatically, using ex- entity, as it is just part of an entity name. We need isting labeled question patterns and KB as weak an ad-hoc entity detection component to handle supervision. As we discussed in the experiment such issues, especially for a web scenario, where part, how to mine high-quality question patterns is users often type entity names in their partial or ab- worth further study for the QA task; (ii) We plan breviation forms. to integrate an ad-hoc NER into our KB-QA sys- tem to alleviate the entity detection issue; (iii) In 4.4.2 Predicate Mapping fact, our proposed QA framework can be general- Some questions lack sufficient evidences to detec- ized to other intelligence besides knowledge bases t predicates. where is Byron Nelson 2012 ? is an as well. Any method that can generate answers to example. Since each relation expression must con- questions, such as the Web-based QA approach, tain at least one content word, this question cannot can be integrated into this framework, by using match any relation expression. Except for Byron them in the question translation component.

975 References Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke S. Zettlemoyer. 2013. Scaling seman- Yoav Artzi and Luke S. Zettlemoyer. 2011. Boot- tic parsers with on-the-fly ontology matching. In strapping semantic parsers from conversations. In EMNLP, pages 1545–1556. EMNLP, pages 421–432. Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Yoav Artzi, Nicholas FitzGerald, and Luke S. Zettle- Learning dependency-based compositional seman- moyer. 2013. Semantic parsing with combinatory tics. In ACL, pages 590–599. categorial grammars. In ACL (Tutorial Abstracts), page 2. Percy Liang, Michael I. Jordan, and Dan Klein. 2013. Learning dependency-based compositional seman- Jonathan Berant, Andrew Chou, Roy Frostig, and Per- tics. Computational Linguistics, 39(2):389–446. cy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP, pages 1533– Thomas Lin, Mausam, and Oren Etzioni. 2012. Entity 1544. linking at web scale. In AKBC-WEKEX, pages 84– 88. Qingqing Cai and Alexander Yates. 2013. Large-scale Raymond J. Mooney. 2007. Learning for semantic semantic parsing via schema matching and lexicon parsing. In CICLing, pages 311–324. extension. In ACL, pages 423–433. Franz Josef Och. 2003. Minimum error rate training in James Clarke and Mirella Lapata. 2010. Discourse statistical machine translation. In ACL, pages 160– constraints for document compression. Computa- 167. tional Linguistics, 36(3):411–441. Hoifung Poon. 2013. Grounded unsupervised seman- Abdessamad Echihabi and Daniel Marcu. 2003. A tic parsing. In ACL, pages 933–943. noisy-channel approach to question answering. In ACL. Yuk Wah Wong and Raymond J. Mooney. 2006. Learning for semantic parsing with statistical ma- Cristina Espana-Bonet and Pere R. Comas. 2012. Full chine translation. In HLT-NAACL. machine translation for factoid question answering. In EACL, pages 20–29. Yuk Wah Wong and Raymond J. Mooney. 2007. Learning synchronous grammars for semantic pars- Anthony Fader, Stephen Soderland, and Oren Etzion- ing with lambda calculus. In ACL. i. 2011. Identifying relations for open information extraction. In EMNLP, pages 1535–1545. John M. Zelle and Raymond J. Mooney. 1996. Learn- ing to parse database queries using inductive . In AAAI/IAAI, Vol. 2, pages 1050– Anthony Fader, Luke S. Zettlemoyer, and Oren Etzioni. 1055. 2013. Paraphrase-driven learning for open question answering. In ACL, pages 1608–1618. Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Struc- Daniel Gerber and Axel-Cyrille Ngonga Ngomo. 2011. tured classification with probabilistic categorial Bootstrapping the linked data web. In ISWC. grammars. In UAI, pages 658–666. Daniel Gerber and Axel-Cyrille Ngonga Ngomo. 2012. Luke S. Zettlemoyer and Michael Collins. 2007. On- Extracting multilingual natural-language patterns line learning of relaxed ccg grammars for parsing to for rdf predicates. In ESWC. logical form. In EMNLP-CoNLL, pages 678–687.

Google. 2013. Freebase. In http://www.freebase.com.

Aditya Kalyanpur, Siddharth Patwardhan, Branimir Boguraev, Adam Lally, and Jennifer Chu-Carroll. 2012. Fact-based question decomposition in deep- qa. IBM Journal of Research and Development, 56(3):13.

Tom Kwiatkowski, Luke S. Zettlemoyer, Sharon Gold- water, and Mark Steedman. 2010. Inducing proba- bilistic ccg grammars from logical form with higher- order unification. In EMNLP, pages 1223–1233.

Tom Kwiatkowski, Luke S. Zettlemoyer, Sharon Gold- water, and Mark Steedman. 2011. Lexical general- ization in ccg grammar induction for semantic pars- ing. In EMNLP, pages 1512–1523.

976