Question Answering on Freebase via Relation Extraction and Textual Evidence

1 2 1, 3 1 Kun Xu , Siva Reddy , Yansong Feng ∗, Songfang Huang and Dongyan Zhao 1Institute of Computer Science & Technology, Peking University, Beijing, China 2School of Informatics, University of Edinburgh, UK 3IBM China Research Lab, Beijing, China xukun, fengyansong, zhaody @pku.edu.cn { [email protected]} [email protected]

Abstract tures, a practically impossible solution for large KBs such as Freebase. Furthermore, mismatches Existing knowledge-based question an- between grammar predicted structures and KB swering systems often rely on small an- structure is also a common problem (Kwiatkowski notated training data. While shallow meth- et al., 2013; Berant and Liang, 2014; Reddy et al., ods like relation extraction are robust to 2014). data scarcity, they are less expressive than On the other hand, instead of building a for- the deep meaning representation methods mal meaning representation, information extraction like semantic parsing, thereby failing at an- methods retrieve a set of candidate answers from swering questions involving multiple con- KB using relation extraction (Yao and Van Durme, straints. Here we alleviate this problem by 2014; Yih et al., 2014; Yao, 2015; Bast and Hauss- empowering a relation extraction method mann, 2015) or distributed representations (Bordes with additional evidence from . et al., 2014; Dong et al., 2015). Designing large We first present a neural network based re- training datasets for these methods is relatively easy lation extractor to retrieve the candidate (Yao and Van Durme, 2014; Bordes et al., 2015; answers from Freebase, and then infer over Serban et al., 2016). These methods are often good Wikipedia to validate these answers. Ex- at producing an answer irrespective of their correct- periments on the WebQuestions question ness. However, handling compositional questions answering dataset show that our method that involve multiple entities and relations, still re- achieves an F1 of 53.3%, a substantial im- mains a challenge. Consider the question what provement over the state-of-the-art. mountain is the highest in north america. Relation extraction methods typically answer with all the 1 Introduction mountains in North America because of the lack of Since the advent of large structured knowledge sophisticated representation for the mathematical bases (KBs) like Freebase (Bollacker et al., 2008), function highest. To select the correct answer, one YAGO (Suchanek et al., 2007) and DBpedia (Auer has to retrieve all the heights of the mountains, and et al., 2007), answering natural language questions sort them in descending order, and then pick the using those structured KBs, also known as KB- first entry. We propose a method based on textual based question answering (or KB-QA), is attract- evidence which can answer such questions without ing increasing research efforts from both natural solving the mathematic functions implicitly. language processing and information retrieval com- Knowledge bases like Freebase capture real munities. world facts, and Web resources like Wikipedia pro- The state-of-the-art methods for this task can vide a large repository of sentences that validate be roughly categorized into two streams. The first or support these facts. For example, a sentence is based on semantic parsing (Berant et al., 2013; in Wikipedia says, Denali (also known as Mount Kwiatkowski et al., 2013), which typically learns McKinley, its former official name) is the highest a grammar that can parse natural language to a so- mountain peak in North America, with a summit phisticated meaning representation language. But elevation of 20,310 feet (6,190 m) above sea level. such sophistication requires a lot of annotated train- To answer our example question against a KB us- ing examples that contains compositional struc- ing a relation extractor, we can use this sentence

2326 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 2326–2336, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics as external evidence, filter out wrong answers and who did shaq first play for pick the correct one. Using textual evidence not only mitigates rep- KB-QA resentational issues in relation extraction, but also Entity Linking Relation Extraction alleviates the data scarcity problem to some extent. shaq: m.012xdf sports.pro_athlete.teams..sports.sports_team_roster.team Consider the question, who was queen isabella’s shaq: m.05n7bp basketball.player.statistics..basketball.player_stats.team shaq: m.06_ttvh …… mother. Answering this question involves predict- ing two constraints hidden in the word mother. One Joint Inference constraint is that the answer should be the parent of Isabella, and the other is that the answer’s gen- m.012xdf sports.pro_athlete.teams..sports.sports_team_roster.team

Los Angeles Lakers, der is female. Such words with multiple latent Wikipedia Dump Boston Celtics, Freebase (with CoreNLP annotations) Orlando Magic, constraints have been a pain-in-the-neck for both Miami Heat semantic parsing and relation extraction, and re- quires larger training data (this phenomenon is Answer Refinement coined as sub-lexical compositionality by Wang Los Angeles Lakers Boston Celtics Orlando Magic et al. (2015)). Most systems are good at trigger- Shaquille O'Neal Shaquille O'Neal Shaquille O'Neal

O'Neal was drafted O'Neal signed O'Neal played for as a free agent with the Los Angeles Lakers the Boston Celtics in the 2010-11 season before by the Orlando Magic with the first overall pick ing the parent constraint, but fail on the other, i.e., retiring in the 1992 NBA draft the answer entity should be female. Whereas the

O’Neal signed as a free agent O’Neal played for the Boston Celtics O’Neal was drafted by the Orlando textual evidence from Wikipedia, . . . her mother with the Los Angeles Lakers in the 2010-11 season before retiring Magic with the first overall pick in the 1992 NBA draft was Isabella of Barcelos . . . , can act as a further Refinement Model constraint to answer the question correctly. We present a novel method for question answer- - - + ing which infers on both structured and unstruc- Orlando Magic tured resources. Our method consists of two main steps as outlined in 2. In the first step we extract § Figure 1: An illustration of our method to find answers for a given question using a structured KB answers for the given question who did shaq first (here Freebase) by jointly performing entity link- play for. ing and relation extraction ( 3). In the next step § we validate these answers using an unstructured resource (here Wikipedia) to prune out the wrong results to find the best entity-relation configura- answers and select the correct ones ( 4). Our evalu- § tion which will produce a list of candidate answer ation results on a benchmark dataset WebQuestions entities. In the step 2, we refine these candidate show that our method outperforms existing state-of- answers by applying an answer refinement model the-art models. Details of our experimental setup which takes the Wikipedia page of the topic entity and results are presented in 5. Our code, data and § into consideration to filter out the wrong answers results can be downloaded from https://github. and pick the correct ones. com/syxu828/QuestionAnsweringOverFB. While the overview in Figure1 works for ques- 2 Our Method tions containing single Freebase relation, it also works for questions involving multiple Freebase Figure1 gives an overview of our method for the relations. Consider the question who plays anakin question “who did shaq first play for”. We have skywalker in star wars 1. The actors who are the an- two main steps: (1) inference on Freebase (KB-QA swers to this question should satisfy the following box); and (2) further inference on Wikipedia (An- constraints: (1) the actor played anakin skywalker; swer Refinement box). Let us take a close look into and (2) the actor played in star wars 1. Inspired step 1. Here we perform entity linking to identify by Bao et al. (2014), we design a dependency tree- a topic entity in the question and its possible Free- based method to handle such multi-relational ques- base entities. We employ a relation extractor to tions. We first decompose the original question predict the potential Freebase relations that could into a set of sub-questions using syntactic patterns exist between the entities in the question and the which are listed in Appendix. The final answer set answer entities. Later we perform a joint inference of the original question is obtained by intersecting step over the entity linking and relation extraction the answer sets of all its sub-questions. For the

2327 example question, the sub-questions are who plays dobj nsubj QPRG anakin skywalker and who plays in star wars 1. aux play for These sub-questions are answered separately over [Who] did [shaq] first Freebase and Wikipedia, and the intersection of their answers to these sub-questions is treated as Feature Extraction dobj play nsubj did first play for the final answer. Word We Representation 3 Inference on Freebase Convolution W1 Given a sub-question, we assume the question word1 that represents the answer has a distinct KB relation r with an entity e found in the question, max( . ) and predict a single KB triple (e, r, ?) for each sub- question (here ? stands for the answer entities). The Feature Vector QA problem is thus formulated as an information W2 extraction problem that involves two sub-tasks, i.e., Output entity linking and relation extraction. We first in- W3 troduce these two components, and then present a KB relations Softmax joint inference procedure which further boosts the overall performance. Figure 2: Overview of the multi-channel convolu- tional neural network for relation extraction. We is 3.1 Entity Linking the word embedding matrix, W1 is the convolution For each question, we use hand-built sequences matrix, W2 is the activation matrix and W3 is the of part-of-speech categories to identify all possi- classification matrix. ble named entity mention spans, e.g., the sequence 3.2.1 MCCNNs for Relation Classification NN (shaq) may indicate an entity. For each men- tion span, we use the entity linking tool S-MART2 In MCCNN, we use two channels, one for syn- (Yang and Chang, 2015) to retrieve the top 5 en- tactic information and the other for sentential in- tities from Freebase. These entities are treated as formation. The network structure is illustrated in candidate entities that will eventually be disam- Figure2. Convolution layer tackles an input of biguated in the joint inference step. For a given varying length returning a fixed length vector (we mention span, S-MART first retrieves all possi- use max pooling) for each channel. These fixed ble entities of Freebase by surface matching, and length vectors are concatenated and then fed into a then ranks them using a statistical model, which softmax classifier, the output dimension of which is trained on the frequency counts with which the is equal to the number of predefined relation types. surface form occurs with the entity. The value of each dimension indicates the confi- dence score of the corresponding relation.

3.2 Relation Extraction Syntactic Features We use the shortest path be- We now proceed to identify the relation between tween an entity mention and the question word in the answer and the entity in the question. Inspired the dependency tree3 as input to the first channel. by the recent success of neural network models in Similar to Xu et al. (2015), we treat the path as KB question-answering (Yih et al., 2015; Dong et a concatenation of vectors of words, dependency al., 2015), and the success of syntactic dependen- edge directions and dependency labels, and feed cies for relation extraction (Liu et al., 2015; Xu it to the convolution layer. Note that, the entity et al., 2015), we propose a Multi-Channel Convo- mention and the question word are excluded from lutional Neural Network (MCCNN) which could the dependency path so as to learn a more general exploit both syntactic and sentential information relation representation in syntactic level. As shown for relation extraction. in Figure2, the dependency path between who and shaq is dobj – play – nsubj . 1who, when, what, where, how, which, why, whom, whose. ← → 2S-MART demo can be accessed at 3We use Stanford CoreNLP dependency parser (Manning http://msre2edemo.azurewebsites.net/ et al., 2014).

2328 Sentential Features This channel takes the them using an SVM rank classifier (Joachims, 2006) words in the sentence as input excluding the ques- which is trained to predict a rank for each pair. Ide- tion word and the entity mention. As illustrated in ally higher rank indicates the prediction is closer Figure2, the vectors for did, first, play and for are to the gold prediction. For training, SVM rank fed into this channel. classifier requires a ranked or scored list of entity- relation pairs as input. We create the training data 3.2.2 Objective Function and Learning containing ranked input pairs as follows: if both The model is learned using pairs of question and epred = egold and rpred = rgold, we assign it with its corresponding gold relation from the training a score of 3. If only the entity or relation equals data. Given an input question x with an annotated to the gold one (i.e., e = e , r = r pred gold pred 6 gold entity mention, the network outputs a vector o(x), or e = e , r = r ), we assign a score pred 6 gold pred gold where the entry ok(x) is the probability that there of 2 (encouraging partial overlap). When both en- exists the k-th relation between the entity and the tity and relation assignments are wrong, we assign K 1 expected answer. We denote t(x) R × as the ∈ a score of 1. target distribution vector, in which the value for the gold relation is set to 1, and others to 0. We 3.3.2 Features compute the cross entropy error between t(x) and For a given entity-relation pair, we extract the fol- o(x), and further define the objective function over lowing features which are passed as an input vector the training data as: to the SVM ranker above:

K Entity Clues. We use the score of the predicted J(θ) = t (x) log o (x) + λ θ 2 entity returned by the entity linking system as a − k k || ||2 x feature. The number of word overlaps between the X Xk=1 entity mention and entity’s Freebase name is also where θ represents the weights, and λ the L2 reg- included as a feature. In Freebase, most entities ularization parameters. The weights θ can be ef- have a relation fb:description which describes the ficiently computed via back-propagation through entity. For instance, in the running example, shaq network structures. To minimize J(θ), we apply is linked to three potential entities m.06 ttvh (Shaq stochastic gradient descent (SGD) with AdaGrad Vs. Television Show), m.05n7bp (Shaq Fu Video (Duchi et al., 2011). Game) and m.012xdf (Shaquille O’Neal). Interest- ingly, the word play only appears in the description 3.3 Joint Entity Linking & Relation Extrac- of Shaquille O’Neal and it occurs three times. We tion count the content word overlap between the given A pipeline of entity linking and relation extraction question and the entity’s description, and include it may suffer from error propagations. As we know, as a feature. entities and relations have strong selectional prefer- Relation Clues. The score of relation returned by ences that certain entities do not appear with certain the MCCNNs is used as a feature. Furthermore, we relations and vice versa. Locally optimized models view each relation as a document which consists of could not exploit these implicit bi-directional pref- the training questions that this relation is expressed erences. Therefore, we use a joint model to find a in. For a given question, we use the sum of the tf-idf globally optimal entity-relation assignment from scores of its words with respect to the relation as a local predictions. The key idea behind is to lever- feature. A Freebase relation r is a concatenation of age various clues from the two local models and a series of fragments r = r .r .r . For instance, the KB to rank a correct entity-relation assignment 1 2 3 the three fragments of people.person.parents are higher than other combinations. We describe the people, person and parents. The first two fragments learning procedure and the features below. indicate the Freebase type of the subject of this re- 3.3.1 Learning lation, and the third fragment indicates the object type, in our case the answer type. We use an indica- Suppose the pair (egold, rgold) represents the gold entity/relation pair for a question q. We tor feature to denote if the surface form of the third take all our entity and relation predictions for fragment (here parents) appears in the question. q, create a list of entity and relation pairs Answer Clues. The above two feature classes in- (e , r ), (e , r ), ..., (e , r ) from q and rank dicate local features. From the entity-relation (e, r) { 0 0 1 1 n n }

2329 pair, we create the query triple (e, r, ?) to retrieve 4.2 Refinement Model the answers, and further extract features from the We treat the refinement process as a binary classi- answers. These features are non-local since we re- fication task over the candidate answers, i.e., cor- quire both e and r to retrieve the answer. One such rect (positive) and incorrect (negative) answer. We feature is using the co-occurrence of the answer prepare the training data for the refinement model type and the question word based on the intuition as follows. On the training dataset, we first in- that question words often indicate the answer type, fer on Freebase to retrieve the candidate answers. e.g., the question word when usually indicates the Then we use the annotated gold answers of these answer type type.datetime. Another feature is the questions and Wikipedia to create the training data. number of answer entities retrieved. Specifically, we treat the sentences that contain correct/incorrect answers as positive/negative ex- 4 Inference on Wikipedia amples for the refinement model. We use LIBSVM We use the best ranked entity-relation pair from (Chang and Lin, 2011) to learn the weights for the above step to retrieve candidate answers from classification. Freebase. In this step, we validate these answers Note that, in the Wikipedia page of the topic en- using Wikipedia as our unstructured knowledge tity, we may collect more than one sentence that resource where most statements in it are verified contain a candidate answer. However, not all sen- for factuality by multiple people. tences are relevant, therefore we consider the can- Our refinement model is inspired by the intuition didate answer as correct if at least there is one of how people refine their answers. If you ask positive evidence. On the other hand, sometimes, someone: who did shaq first play for, and give we may not find any evidence for the candidate them four candidate answers (Los Angeles Lakers, answer. In these cases, we fall back to the results Boston Celtics, Orlando Magic and Miami Heat), of the KB-based approach. as well as access to Wikipedia, that person might first determine that the question is about Shaquille 4.3 Lexical Features O’Neal, then go to O’Neal’s Wikipedia page, and Regarding the features used in LIBSVM, we use the search for the sentences that contain the candidate following lexical features extracted from the ques- answers as evidence. By analyzing these sentences, tion and a Wikipedia sentence. Formally, given a one can figure out whether a candidate answer is question q = and an evidence sentence correct or not. s = , we denote the tokens of q and s by qi and sj, respectively. For each pair (q, s), we 4.1 Finding Evidence from Wikipedia identify a set of all possible token pairs (qi, sj), As mentioned above, we should first find the the occurrences of which are used as features. As Wikipedia page corresponding to the topic entity in learning proceeds, we hope to learn a higher weight the given question. We use Freebase API to con- for a feature like (first, drafted) and a lower weight vert Freebase entity to Wikipedia page. We extract for (first, played). the content from the Wikipedia page and process it with Wikifier (Cheng and Roth, 2013) which rec- 5 Experiments ognizes Wikipedia entities, which can further be linked to Freebase entities using Freebase API. Ad- In this section we introduce the experimental setup, ditionally we use Stanford CoreNLP (Manning et the main results and detailed analysis of our system. al., 2014) for tokenization and entity co-reference resolution. We search for the sentences containing 5.1 Training and Evaluation Data the candidate answer entities retrieved from Free- We use the WebQuestions (Berant et al., 2013) base. For example, the Wikipedia page of O’Neal dataset, which contains 5,810 questions crawled contains a sentence “O’Neal was drafted by the Or- via Google Suggest service, with answers anno- lando Magic with the first overall pick in the 1992 tated on Amazon Mechanical Turk. The questions NBA draft”, which is taken into account by the re- are split into training and test sets, which contain finement model (our inference model on Wikipedia) 3,778 questions (65%) and 2,032 questions (35%), to discriminate whether Orlando Magic is the an- respectively. We further split the training questions swer for the given question. into 80%/20% for development.

2330 To train the MCCNNs and the joint inference Method average F1 model, we need the gold standard relations of the Berant et al. (2013) 35.7 questions. Since this dataset contains only question- Yao and Van Durme (2014) 33.0 Xu et al. (2014) 39.1 answer pairs and annotated topic entities, instead Berant and Liang (2014) 39.9 of relying on gold relations we rely on surrogate Bao et al. (2014) 37.5 gold relations which produce answers that have the Bordes et al. (2014) 39.2 Dong et al. (2015) 40.8 highest overlap with gold answers. Specifically, for Yao (2015) 44.3 a given question, we first locate the topic entity e Bast and Haussmann (2015) 49.4 in the Freebase graph, then select 1-hop and 2-hop Berant and Liang (2015) 49.7 Reddy et al. (2016) 50.3 relations connected to the topic entity as relation Yih et al. (2015) 52.5 n candidates. The 2-hop relations refer to the -ary This work relations of Freebase, i.e., first hop from the sub- Structured 44.1 ject to a mediator node, and the second from the Structured + Joint 47.1 mediator to the object node. For each relation can- Structured + Unstructured 47.0 didate r, we issue the query (e, r, ?) to the KB, Structured + Joint + Unstructured 53.3 and label the relation that produces the answer with Table 1: Results on the test set. minimal F1-loss against the gold answer, as the surrogate gold relation. From the training set, we collect 461 relations to train the MCCNN, and the Structured + Joint. In this method instead of target prediction during testing time is over these the above pipeline, we perform joint EL and RE as described in 3.3. relations. § Structured+Unstructured. We use the 5.2 Experimental Settings pipelined EL and RE along with inference We have 6 dependency tree patterns based on Bao on Wikipedia as described in 4. § et al. (2014) to decompose the question into sub- This is our questions (See Appendix). We initialize the word Structured + Joint + Unstructured. main model. We perform inference on Freebase embeddings with Turian et al. (2010)’s word rep- using joint EL and RE, and then inference on resentations with dimensions set to 50. The hyper Wikipedia to validate the results. Specifically, we parameters in our model are tuned using the devel- treat the top two predictions of the joint inference opment set. The window size of MCCNN is set model as the candidate subject and relation pairs, to 3. The sizes of the hidden layer 1 and the hidden and extract the corresponding answers from each layer 2 of the two MCCNN channels are set to 200 pair, take the union, and filter the answer set using and 100, respectively. We use the Freebase version Wikipedia. of Berant et al. (2013), containing 4M entities and Table1 summarizes the results on the test data 5,323 relations. along with the results from the literature.5 We can 5.3 Results and Discussion see that joint EL and RE performs better than the default pipelined approach, and outperforms most We use the average question-wise F1 as our eval- semantic parsing based models, except (Berant and 4 uation metric. To give an idea of the impact of Liang, 2015) which searches partial logical forms different configurations of our method, we compare in strategic order by combining imitation learn- the following with existing methods. ing and agenda-based parsing. In addition, infer- ence on unstructured data helps the default model. Structured. This method involves inference on The joint EL and RE combined with inference Freebase only. First the entity linking (EL) system on unstructured data further improves the default is run to predict the topic entity. Then we run pipelined model by 9.2% (from 44.1% to 53.3%), the relation extraction (RE) system and select the and achieves a new state-of-the-art result beating best relation that can occur with the topic entity. the previous reported best result of Yih et al. (2015) We choose this entity-relation pair to predict the (with one-tailed t-test significance of p < 0.05). answer. 5We use development data for all our ablation experiments. 4We use the evaluation script available at http:// Similar trends are observed on both development and test www-nlp.stanford.edu/software/sempre. results.

2331 Entity Linking Relation Extraction Question & Answers Accuracy Accuracy 1. what is the largest nation in europe Before: Kazakhstan, Turkey, Russia, ... Isolated Model 79.8 45.9 After: Russia Joint Inference 83.2 55.3 2. which country in europe has the largest land area Before: Georgia, France, Russia, ... Table 2: Impact of the joint inference on the devel- After: Russian Empire, Russia opment set 3. what year did ray allen join the nba Before: 2007, 2003, 1996, 1993, 2012 After: 1996 4. who is emma stone father Method average F1 Before: Jeff Stone, Krista Stone Structured (syntactic) 38.1 After: Jeff Stone Structured (sentential) 38.7 5. where did john steinbeck go to college Structured (syntactic + sentential) 40.1 Before: Salinas High School, Stanford University After: Stanford University Structured + Joint (syntactic) 43.6 Structured + Joint (sentential) 44.1 Structured + Joint (syntactic + sentential) 45.8 Table 4: Example questions and corresponding pre- dicted answers before and after using unstructured Table 3: Impact of different MCCNN channels on inference. Before uses (Structured + Joint) model, the development set. and After uses Structured + Joint + Unstructured model for prediction. The colors blue and red indi- cate correct and wrong answers respectively. 5.3.1 Impact of Joint EL & RE

From Table1, we can see that the joint EL & RE 5.3.3 Impact of the Inference on gives a performance boost of 3% (from 44.1 to Unstructured Data 47.1). We also analyze the impact of joint inference As shown in Table1, when structured inference is on the individual components of EL & RE. augmented with the unstructured inference, we see We first evaluate the EL component using the an improvement of 2.9% (from 44.1% to 47.0%). gold entity annotations on the development set. As And when Structured + Joint uses unstructured shown in Table2, for 79.8% questions, our entity inference, the performance boosts by 6.2% (from linker can correctly find the gold standard topic 47.1% to 53.3%) achieving a new state-of-the-art entities. The joint inference improves this result to result. For the latter, we manually analyzed the 83.2%, a 3.4% improvement. Next we use the sur- cases in which unstructured inference helps. Ta- rogate gold relations to evaluate the performance ble4 lists some of these questions and the corre- of the RE component on the development set. As sponding answers before and after the unstructured shown in Table2, the relation prediction accuracy inference. We observed the unstructured infer- increases by 9.4% (from 45.9% to 55.3%) when ence mainly helps for two classes of questions: (1) using the joint inference. questions involving aggregation operations (Ques- tions 1-3); (2) questions involving sub-lexical com- positionally (Questions 4-5). Questions 1 and 2 5.3.2 Impact of the Syntactic and the contain the predicate largest an aggregation oper- Sentential Channels ator. A semantic parsing method should explicitly handle this predicate to trigger max(.) operator. Table3 presents the results on the impact of individ- For Question 3, structured inference predicts the ual and joint channels on the end QA performance. Freebase relation fb:teams..from retrieving all the When using a single-channel network, we tune the years in which Ray Allen has played basketball. parameters of only one channel while switching off Note that Ray Allen has joined Connecticut Uni- the other channel. As seen, the sentential features versity’s team in 1993 and NBA from 1996. To an- are found to be more important than syntactic fea- swer this question a semantic parsing system would tures. We attribute this to the short and noisy nature require a min( ) operator along with an additional of WebQuestions questions due to which syntactic · constraint that the year corresponds to the NBA’s parser wrongly parses or the shortest dependency term. Interestingly, without having to explicitly path does not contain sufficient information to pre- model these complex predicates, the unstructured dict a relation. By using both the channels, we see inference helps in answering these questions more further improvements than using any one of the accurately. Questions 4-5 involve sub-lexical com- channels.

2332 positionally (Wang et al., 2015) predicates father multi-hop relations and the inter alia. Our current and college. For example in Question 5, the user assumption that unstructured data could provide ev- queries for the colleges that John Steinbeck at- idence for questions may work only for frequently tended. However, Freebase defines the relation typed queries or for popular domains like movies, fb:education..institution to describe a person’s ed- politics and geography. We note these limitations ucational information without discriminating the and hope our result will foster further research in specific periods such as high school or college. In- this area. ference using unstructured data helps in alleviating these representational issues. 6 Related Work

5.3.4 Error analysis Over time, the QA task has evolved into two main We analyze the errors of Structured + Joint + Un- streams – QA on unstructured data, and QA on structured model. Around 15% of the errors are structured data. TREC QA evaluations (Voorhees caused by incorrect entity linking, and around 50% and Tice, 1999) were a major boost to unstruc- of the errors are due to incorrect relation predic- tured QA leading to richer datasets and sophisti- tions. The errors in relation extraction are due cated methods (Wang et al., 2007; Heilman and to (i) insufficient context, e.g., in what is duncan Smith, 2010; Yao et al., 2013; Yih et al., 2013; bannatyne, neither the dependency path nor sen- Yu et al., 2014; Yang et al., 2015; Hermann et tential context provides enough evidence for the al., 2015). While initial progress on structured MCCNN model; (ii) unbalanced distribution of re- QA started with small toy domains like GeoQuery lations (3022 training examples for 461 relations) (Zelle and Mooney, 1996), recent focus has shifted heavily influences the performance of MCCNN to large scale structured KBs like Freebase, DB- model towards frequently seen relations. The re- Pedia (Unger et al., 2012; Cai and Yates, 2013; maining errors are the failure of unstructured infer- Berant et al., 2013; Kwiatkowski et al., 2013; Xu ence due to insufficient evidence in Wikipedia or et al., 2014), and on noisy KBs (Banko et al., 2007; misclassification. Carlson et al., 2010; Krishnamurthy and Mitchell, 2012; Fader et al., 2013; Parikh et al., 2015). An Entity Linking. In the entity linking component, we had handcrafted POS tag patterns to identify exciting development in structured QA is to exploit entity mentions, e.g., DT-JJ-NN (noun phrase), NN- multiple KBs (with different schemas) at the same IN-NN (prepositional phrase). These patterns are time to answer questions jointly (Yahya et al., 2012; designed to have high recall. Around 80% of entity Fader et al., 2014; Zhang et al., 2016). QALD tasks linking errors are due to incorrect entity prediction and initiatives are contributing to this even when the correct mention span was found. trend. Our model combines the best of both worlds Question Decomposition. Around 136 ques- by inferring over structured and unstructured data. tions (15%) of dev data contains compositional Though earlier methods exploited unstructured data questions, leading to 292 sub-questions (around for KB-QA (Krishnamurthy and Mitchell, 2012; 2.1 subquestions for a compositional question). Berant et al., 2013; Yao and Van Durme, 2014; Since our question decomposition component is Reddy et al., 2014; Yih et al., 2015), these methods based on manual rules, one question of interest do not rely on unstructured data at test time. Our is how these rules perform on other datasets. By work is closely related to Joshi et al. (2014) who human evaluation, we found these rules achieves aim to answer noisy telegraphic queries using both 95% on a more general but complex QA dataset structured and unstructured data. Their work is lim- QALD-56. ited in answering single relation queries. Our work also has similarities to Sun et al. (2015) who does 5.3.5 Limitations question answering on unstructured data but enrich While our unstructured inference alleviates repre- it with Freebase, a reversal of our pipeline. Other sentational issues to some extent, we still fail at line of very recent related work include Yahya et modeling compositional questions such as who is al. (2016) and Savenkov and Agichtein (2016). the mother of the father of prince william involving Our work also intersects with relation extrac- 6http://qald.sebastianwalter.org/index. tion methods. While these methods aim to predict php?q=5 a relation between two entities in order to pop-

2333 ulate KBs (Mintz et al., 2009; Hoffmann et al., gaps in either of the two resources. 2011; Riedel et al., 2013), we work with sentence level relation extraction for question answering. Kr- Acknowledgments ishnamurthy and Mitchell (2012) and Fader et al. We would like to thank Weiwei Sun, Liwei Chen, (2014) adopt open relation extraction methods for and the anonymous reviewers for their helpful feed- QA but they require hand-coded grammar for pars- back. This work is supported by National High ing queries. Closest to our extraction method is Technology R&D Program of China (Grant No. Yao and Van Durme (2014) and Yao (2015) who 2015AA015403, 2014AA015102), Natural Sci- also uses sentence level relation extraction for QA. ence Foundation of China (Grant No. 61202233, Unlike them, we can predict multiple relations per 61272344, 61370055) and the joint project with question, and our MCCNN architecture is more ro- IBM Research. For any correspondence, please bust to unseen contexts compared to their logistic contact Yansong Feng. regression models. Dong et al. (2015) were the first to use MCCNN Appendix for question answering. Yet our approach is very The syntax-based patterns for question decomposi- different in spirit to theirs. Dong et al. aim to tion are shown in Figure3. The first four patterns maximize the similarity between the distributed are designed to extract sub-questions from simple representation of a question and its answer entities, questions, while the latter two are designed for whereas our network aims to predict Freebase re- complex questions involving clauses. lations. Our search space is several times smaller than theirs since we do not require potential an- verb verb swer entities beforehand (the number of relations is much smaller than the number of entities in Free- subj obj1 prep subj prep* base). In addition, our method can explicitly handle obj2 obj1 and obj2 (a) (b) compositional questions involving multiple rela- tions, whereas Dong et al. learn latent representa- subj verb verb tion of relation joins which is difficult to compre- subj prep* and verb hend. Moreover, we outperform their method by obj1 prep* prep* … prep*

7 points even without unstructured inference. obj2 obj1 … objk (d) (c)

7 Conclusion and Future Work WDT WDT

We have presented a method that could infer both subj verb subj verb on structured and unstructured data to answer natu- obj1 prep*

obj1 ral language questions. Our experiments reveal that (e) unstructured inference helps in mitigating represen- (f) tational issues in structured inference. We have also introduced a relation extraction method using Figure 3: Syntax-based patterns for question de- MCCNN which is capable of exploiting syntax in composition. addition to sentential features. Our main model which uses joint entity linking and relation extrac- tion along with unstructured inference achieves the References state-of-the-art results on WebQuestions dataset. A potential application of our method is to improve Sren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G. Ives. KB-question answering using the documents re- 2007. Dbpedia: A nucleus for a web of open data. trieved by a search engine. In ISWC/ASWC. Since we pipeline structured inference first and Michele Banko, Michael J Cafarella, Stephen Soder- then unstructured inference, our method is limited land, Matthew Broadhead, and Oren Etzioni. 2007. by the coverage of Freebase. Our future work in- Open information extraction for the web. In IJCAI. volves exploring other alternatives such as treating Junwei Bao, Nan Duan, Ming Zhou, and Tiejun Zhao. structured and unstructured data as two indepen- 2014. Knowledge-based question answering as ma- dent resources in order to overcome the knowledge chine translation. In ACL.

2334 Hannah Bast and Elmar Haussmann. 2015. More ac- Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- curate question answering on freebase. In CIKM. stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read Jonathan Berant and Percy Liang. 2014. Semantic and comprehend. In Advances in Neural Informa- parsing via paraphrasing. In ACL. tion Processing Systems. Jonathan Berant and Percy Liang. 2015. Imitation Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke learning of agenda-based semantic parsers. Transac- Zettlemoyer, and Daniel S Weld. 2011. Knowledge- tions of the Association for Computational Linguis- based weak supervision for information extraction tics, 3:545–558. of overlapping relations. In ACL. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Thorsten Joachims. 2006. Training linear svms in lin- Liang. 2013. Semantic parsing on freebase from ear time. In SIGKDD. question-answer pairs. In EMNLP. Mandar Joshi, Uma Sawant, and Soumen Chakrabarti. Kurt D. Bollacker, Colin Evans, Praveen Paritosh, Tim 2014. and corpus driven segmen- Sturge, and Jamie Taylor. 2008. Freebase: a col- tation and answer inference for telegraphic entity- laboratively created graph for structuring seeking queries. In EMNLP. human knowledge. In SIGMOD. Jayant Krishnamurthy and Tom M Mitchell. 2012. Antoine Bordes, Sumit Chopra, and Jason Weston. Weakly supervised training of semantic parsers. In 2014. Question answering with subgraph embed- EMNLP-CoNLL. dings. In EMNLP. Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Luke S. Zettlemoyer. 2013. Scaling semantic Jason Weston. 2015. Large-scale simple ques- parsers with on-the-fly ontology matching. In tion answering with memory networks. CoRR, EMNLP. abs/1506.02075. Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, Qingqing Cai and Alexander Yates. 2013. Large-scale and Houfeng WANG. 2015. A dependency-based ACL semantic parsing via schema matching and lexicon neural network for relation classification. In . extension. In ACL. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David Mc- Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Closky. 2014. The Stanford CoreNLP natural lan- Settles, Estevam R Hruschka Jr, and Tom M guage processing toolkit. In ACL System Demon- Mitchell. 2010. Toward an architecture for never- strations. ending language learning. In AAAI. Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- Chih-Chung Chang and Chih-Jen Lin. 2011. LIB- sky. 2009. Distant supervision for relation extrac- SVM: A library for support vector machines. ACM tion without labeled data. In ACL. TIST, 2(3):27. Ankur P. Parikh, Hoifung Poon, and Kristina Xiao Cheng and Dan Roth. 2013. Relational inference Toutanova. 2015. Grounded semantic parsing for for wikification. In ACL. complex . In NAACL. Li Dong, Furu Wei, Ming Zhou, and Ke Xu. Siva Reddy, Mirella Lapata, and Mark Steedman. 2014. 2015. Question answering over freebase with multi- Large-scale semantic parsing without question- column convolutional neural networks. In ACL- answer pairs. Transactions of the Association of IJCNLP. Computational Linguistics, pages 377–392. John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Siva Reddy, Oscar Tackstr¨ om,¨ Michael Collins, Tom Adaptive subgradient methods for online learning Kwiatkowski, Dipanjan Das, Mark Steedman, and and stochastic optimization. Journal of Machine Mirella Lapata. 2016. Transforming Dependency Learning Research, 12:2121–2159. Structures to Logical Forms for Semantic Parsing. Transactions of the Association for Computational Anthony Fader, Luke S. Zettlemoyer, and Oren Etzioni. Linguistics, 4. 2013. Paraphrase-driven learning for open question answering. In ACL. Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. with matrix factorization and universal schemas. In 2014. Open question answering over curated and NAACL. extracted knowledge bases. In SIGKDD. Denis Savenkov and Eugene Agichtein. 2016. When Michael Heilman and Noah A Smith. 2010. Tree edit a is not enough: Question answer- models for recognizing textual entailments, para- ing over knowledge bases with external text data. In phrases, and answers to questions. In NAACL. SIGIR.

2335 Iulian Vlad Serban, Alberto Garc´ıa-Duran,´ C¸aglar Xuchen Yao and Benjamin Van Durme. 2014. Infor- Gulc¸ehre,¨ Sungjin Ahn, Sarath Chandar, Aaron C. mation extraction over structured data: Question an- Courville, and Yoshua Bengio. 2016. Generat- swering with freebase. In ACL. ing factoid questions with recurrent neural networks: The 30m factoid question-answer corpus. In ACL. Xuchen Yao, Benjamin Van Durme, and Peter Clark. 2013. Answer extraction as sequence tagging with Fabian M. Suchanek, Gjergji Kasneci, and Gerhard tree edit distance. In NAACL. Weikum. 2007. Yago: a core of semantic knowl- edge. In WWW. Xuchen Yao. 2015. Lean question answering over free- base from scratch. In NAACL. Huan Sun, Hao Ma, Wen-tau Yih, Chen-Tse Tsai, Jingjing Liu, and Ming-Wei Chang. 2015. Open do- Wen-tau Yih, Ming-Wei Chang, Christopher Meek, and main question answering via semantic enrichment. Andrzej Pastusiak. 2013. Question answering using In WWW. enhanced lexical semantic models. In ACL.

Joseph P. Turian, Lev-Arie Ratinov, and Yoshua Ben- Wen-tau Yih, Xiaodong He, and Christopher Meek. gio. 2010. Word representations: A simple and gen- 2014. Semantic parsing for single-relation question eral method for semi-supervised learning. In ACL answering. In ACL. 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11- Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and 16, 2010, Uppsala, Sweden, pages 384–394. Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with Christina Unger, Lorenz Buhmann,¨ Jens Lehmann, knowledge base. In ACL-IJCNLP. Axel-Cyrille Ngonga Ngomo, Daniel Gerber, and Philipp Cimiano. 2012. Template-based question Lei Yu, Karl Moritz Hermann, Phil Blunsom, and answering over rdf data. In WWW. Stephen Pulman. 2014. Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632. Ellen M Voorhees and Dawn M. Tice. 1999. The trec-8 John M Zelle and Raymond J Mooney. 1996. Learn- question answering track report. In TREC. ing to parse database queries using inductive logic Mengqiu Wang, Noah A Smith, and Teruko Mita- programming. In AAAI. mura. 2007. What is the jeopardy model? a quasi- Yuanzhe Zhang, Shizhu He, Kang Liu, and Jun Zhao. synchronous grammar for qa. In EMNLP-CoNLL. 2016. A joint model for question answering over Yushi Wang, Jonathan Berant, and Percy Liang. 2015. multiple knowledge bases. In AAAI. Building a semantic parser overnight. In ACL.

Kun Xu, Sheng Zhang, Yansong Feng, and Dongyan Zhao. 2014. Answering natural language ques- tions via phrasal semantic parsing. In Natural Lan- guage Processing and Chinese Computing - Third CCF Conference, NLPCC 2014, Shenzhen, China, December 5-9, 2014. Proceedings, pages 333–344.

Kun Xu, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2015. Semantic relation classifica- tion via convolutional neural networks with simple negative sampling. In EMNLP.

Mohamed Yahya, Klaus Berberich, Shady Elbas- suoni, Maya Ramanath, Volker Tresp, and Gerhard Weikum. 2012. Natural language questions for the web of data. In EMNLP.

Mohamed Yahya, Denilson Barbosa, Klaus Berberich, Qiuyue Wang, and Gerhard Weikum. 2016. Rela- tionship queries on extended knowledge graphs. In WSDM.

Yi Yang and Ming-Wei Chang. 2015. S-mart: Novel tree-based structured learning algorithms applied to tweet entity linking. In ACL-IJNLP.

Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain ques- tion answering. In EMNLP.

2336