Freebase QA: Information Extraction Or Semantic Parsing?

Freebase QA: Information Extraction or Semantic Parsing? Xuchen Yao 1 Jonathan Berant 3 Benjamin Van Durme 1,2 1Center for Language and Speech Processing 2Human Language Technology Center of Excellence Johns Hopkins University 3Computer Science Department Stanford University Abstract We characterize semantic parsing as the task of deriving a representation of meaning from lan- We contrast two seemingly distinct ap- guage, sufficient for a given task. Traditional proaches to the task of question answering information extraction (IE) from text may be (QA) using Freebase: one based on infor- coarsely characterized as representing a certain mation extraction techniques, the other on level of semantic parsing, where the goal is to semantic parsing. Results over the same derive enough meaning in order to populate a test-set were collected from two state-of- database with factoids of a form matching a given the-art, open-source systems, then ana- schema.1 Given the ease with which reasonably lyzed in consultation with those systems’ accurate, deep syntactic structure can be automat- creators. We conclude that the differ- ically derived over (English) text, it is not surpris- ences between these technologies, both ing that IE researchers would start including such in task performance, and in how they “features” in their models. get there, is not significant. This sug- Our question is then: what is the difference be- gests that the semantic parsing commu- tween an IE system with access to syntax, as com- nity should target answering more com- pared to a semantic parser, when both are targeting positional open-domain questions that are a factoid-extraction style task? While our conclu- beyond the reach of more direct informa- sions should hold generally for similar KBs, we tion extraction methods. will focus on Freebase, such as explored by Kr- ishnamurthy and Mitchell (2012), and then others 1 Introduction such as Cai and Yates (2013a) and Berant et al. Question Answering (QA) from structured data, (2013). We compare two open-source, state-of- such as DBPedia (Auer et al., 2007), Freebase the-art systems on the task of Freebase QA: the (Bollacker et al., 2008) and Yago2 (Hoffart et semantic parsing system SEMPRE (Berant et al., al., 2011), has drawn significant interest from 2013), and the IE system jacana-freebase (Yao both knowledge base (KB) and semantic pars- and Van Durme, 2014). ing (SP) researchers. The majority of such work We find that these two systems are on par with treats the KB as a database, to which standard each other, with no significant differences in terms database queries (SPARQL, MySQL, etc.) are is- of accuracy between them. A major distinction be- sued to retrieve answers. Language understand- tween the work of Berant et al. (2013) and Yao ing is modeled as the task of converting natu- and Van Durme (2014) is the ability of the for- ral language questions into queries through inter- mer to represent, and compose, aggregation oper- mediate logical forms, with the popular two ap- ators (such as argmax, or count), as well as in- proaches including: CCG parsing (Zettlemoyer tegrate disparate pieces of information. This rep- and Collins, 2005; Zettlemoyer and Collins, 2007; resentational capability was important in previous, Zettlemoyer and Collins, 2009; Kwiatkowski et closed-domain tasks such as GeoQuery. The move al., 2010; Kwiatkowski et al., 2011; Krishna- to Freebase by the SP community was meant to murthy and Mitchell, 2012; Kwiatkowski et al., 1 2013; Cai and Yates, 2013a), and dependency- So-called Open Information Extraction (OIE) is simply a further blurring of the distinction between IE and SP, where based compositional semantics (Liang et al., 2011; the schema is allowed to grow with the number of verbs, and Berant et al., 2013; Berant and Liang, 2014). other predicative elements of the language. 82 Proceedings of the ACL 2014 Workshop on Semantic Parsing, pages 82–86, Baltimore, Maryland USA, June 26 2014. c 2014 Association for Computational Linguistics provide richer, open-domain challenges. While lexicon is used to map NL phrases to KB predi- the vocabulary increased, our analysis suggests cates, and then predicates are combined to form a that compositionality and complexity decreased. full logical form by a context-free grammar. Since We therefore conclude that the semantic parsing logical forms can be derived in multiple ways from community should target more challenging open- the grammar, a log-linear model is used to rank domain datasets, ones that “standard IE” methods possible derivations. The parameters of the model are less capable of attacking. are trained from question-answer pairs. 2 IE and SP Systems 3 Analysis jacana-freebase2 (Yao and Van Durme, 2014) 3.1 Evaluation Metrics treats QA from a KB as a binary classification Both Berant et al. (2013) and Yao and problem. Freebase is a gigantic graph with mil- Van Durme (2014) tested their systems on lions of nodes (topics) and billions of edges (re- the WEBQUESTIONS dataset, which contains lations). For each question, jacana-freebase 3778 training questions and 2032 test questions first selects a “view” of Freebase concerning only collected from the Google Suggest API. Each involved topics and their close neighbors (this question came with a standard answer from “view” is called a topic graph). For instance, Freebase annotated by Amazon Mechanical Turk. for the question “who is the brother of justin Berant et al. (2013) reported a score of 31.4% bieber?”, the topic graph of Justin Bieber, con- in terms of accuracy (with partial credit if inexact taining all related nodes to the topic (think of the match) on the test set and later in Berant and Liang “Justin Bieber” page displayed by the browser), is (2014) revised it to 35.7%. Berant et al. focused selected and retrieved by the Freebase Topic API. on accuracy – how many questions were correctly Usually such a topic graph contains hundreds to answered by the system. Since their system an- thousands of nodes in close relation to the central swered almost all questions, accuracy is roughly topic. Then each of the node is judged as answer identical to F1. Yao and Van Durme (2014)’s sys- or not by a logistic regression learner. tem on the other hand only answered 80% of all Features for the logistic regression learner are test questions. Thus they report a score of 42% first extracted from both the question and the in terms of F1 on this dataset. For the purpose of topic graph. An analysis of the dependency comparing among all test questions, we lowered parse of the question characterizes the question the logistic regression prediction threshold (usu- word, topic, verb, and named entities of the ally 0.5) on jacana-freebase for the other 20% main subject as the question features, such as of questions where jacana-freebase had not pro- qword=who. Features on each node include the posed an answer to, and selected the best-possible types of relations and properties the node pos- prediction with the highest prediction score as the sesses, such as type=person. Finally features answer. In this way jacana-freebase was able from both the question and each node are com- to answer all questions with a lower accuracy of bined as the final features used by the learner, such 35.4%. In the following we present analysis re- as qword=who type=person. In this way the as- sults based on the test questions where the two | sociation between the question and answer type systems had very similar performance (35.7% vs. is enforced. Thus during decoding, for instance, 35.4%).4 The difference is not significant accord- if there is a who question, the nodes with a per- ing to the paired permutation test (Smucker et al., son property would be ranked higher as the an- 2007). swer candidate. 3.2 Accuracy vs. Coverage SEMPRE3 is an open-source system for training semantic parsers, that has been utilized to train a First, we were interested to see the proportions of semantic parser against Freebase by Berant et al. questions SEMPRE and jacana-freebase jointly (2013). SEMPRE maps NL utterances to logical and separately answered correctly. The answer to 4 forms by performing bottom-up parsing. First, a In this setting accuracy equals averaged macro F1: first the F1 value on each question were computed, then averaged 2https://code.google.com/p/jacana/ among all questions, or put it in other words: “accuracy with 3http://www-nlp.stanford.edu/software/ partial credit”. In this section our usage of the terms “accu- sempre/ racy” and “F1” can be exchanged. 83 Accuracy vs. Coverage 70 jacana (F1 = 1) jacana (F1 0.5) jacana-freebase ≥ √ √ SEMPRE × × 60 √ 153 (0.08) 383 (0.19) 429 (0.21) 321 (0.16) EMPRE S 136 (0.07) 1360 (0.67) 366 (0.18) 916 (0.45) 50 × Table 1: The absolute and proportion of ques- Accuracy 40 tions SEMPRE and jacana-freebase answered correctly (√) and incorrectly ( ) jointly and sep- 30 × arately, running a threshold F1 of 1 and 0.5. 20 0 10 20 30 40 50 60 70 80 90 100 Percent Answered many questions in the dataset is a set of answers, Figure 1: Precision with respect to proportion of for example what to see near sedona arizona?. questions answered Since turkers did not exhaustively pick out all possible answers, evaluation is performed by comput- ing the F1 between the set of answers given by a threshold from 1 to 0, we select those questions the system and the answers provided by turkers. with an answer confidence score above the thresh- With a strict threshold of F1 = 1 and a permis- old and compute accuracy at this point.

Load more