Using Semantic Roles to Improve Question Answering

Dan Shen Mirella Lapata Spoken Language Systems School of Informatics Saarland University University of Edinburgh Saarbruecken, Germany Edinburgh, UK [email protected] [email protected]

Abstract elements) are defined for each frame and correspond to salient entities present in the evoked situ- Shallow semantic parsing, the automatic ation. Predicates with similar semantics instantiate identification and labeling of sentential con- the same frame and are attested with the same roles. stituents, has recently received much atten- The FrameNet database lists the surface syntactic tion. Our work examines whether seman- realizations of semantic roles, and provides anno- tic role information is beneficial to question tated example sentences from the British National answering. We introduce a general frame- Corpus. For example, the frame Commerce Sell has work for answer extraction which exploits three core semantic roles, namely Buyer, Goods, and semantic role annotations in the FrameNet Seller — each expressed by an indirect object, a di- paradigm. We view semantic role assign- rect object, and a subject (see sentences (1a)–(1c)). ment as an optimization problem in a bipar- It can also be attested with non-core (peripheral) tite graph and answer extraction as an in- roles (e.g., Means, Manner, see (1d) and (1e)) that stance of graph matching. Experimental re- are more generic and can be instantiated in sev- sults on the TREC datasets demonstrate im- eral frames, besides Commerce Sell. The verbs sell, provements over state-of-the-art models. vend, and retail can evoke this frame, but also the nouns sale and vendor. 1 Introduction Recent years have witnessed significant progress in (1) a. [Lee]Seller sold a textbook [to developing methods for the automatic identification Abby]Buyer. and labeling of semantic roles conveyed by senten- b. [Kim]Seller sold [the sweater]Goods. 1 tial constituents. The success of these methods, of- c. [My company]Seller has sold [more ten referred to collectively as shallow semantic pars- than three million copies]Goods. ing (Gildea and Jurafsky, 2002), is largely due to the d. [Abby]Seller sold [the car]Goods [for availability of resources like FrameNet (Fillmore et cash]Means. al., 2003) and PropBank (Palmer et al., 2005), which e. [He]Seller [reluctanctly]Manner sold document the surface realization of semantic roles in [his rock] . real world corpora. Goods More concretely, in the FrameNet paradigm, the By abstracting over surface syntactic configura- meaning of predicates (usually verbs, nouns, or ad- tions, semantic roles offer an important first step to- jectives) is conveyed by frames, schematic repre- wards deeper text understanding and hold promise sentations of situations. Semantic roles (or frame for a range of applications requiring broad cover- 1The approaches are too numerous to list; we refer the inter- age semantic processing. Question answering (QA) ested reader to Carreras and Marquez` (2005) for an overview. is often cited as an obvious beneficiary of semantic

12 Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 12–21, Prague, June 2007. c 2007 Association for Computational Linguistics role labeling (Gildea and Jurafsky, 2002; Palmer et exploit semantic role-based lexical resources. Then al., 2005; Narayanan and Harabagiu, 2004). Faced we define our learning task and introduce our ap- with the question Q: What year did the U.S. buy proach to semantic role assignment and answer ex- Alaska? and the retrieved sentence S: ...before Rus- traction in the context of QA. Next, we present our sia sold Alaska to the United States in 1867, a hypo- experimental framework and data. We conclude the thetical QA system must identify that United States paper by presenting and discussing our results. is the Buyer despite the fact that it is attested in one instance as a subject and in another as an object. 2 Related Work Once this information is known, isolating the correct answer (i.e., 1867) can be relatively straightforward. Question answering systems have traditionally de- Although conventional wisdom has it that seman- pended on a variety of lexical resources to bridge tic role labeling ought to improve answer extraction, surface differences between questions and potential surprising little work has been done to this effect answers. WordNet (Fellbaum, 1998) is perhaps the (see Section 2 for details) and initial results have most popular resource and has been employed in been mostly inconclusive or negative (Sun et al., a variety of QA-related tasks ranging from query 2005; Kaisser, 2006). There are at least two good expansion, to axiom-based reasoning (Moldovan et reasons for these findings. First, shallow semantic al., 2003), passage scoring (Paranjpe et al., 2003), parsers trained on declarative sentences will typi- and answer filtering (Leidner et al., 2004). Besides cally have poor performance on questions and gen- WordNet, recent QA systems increasingly rely on erally on out-of-domain data. Second, existing re- syntactic information as a means of abstracting over sources do not have exhaustive coverage and recall word order differences and structural alternations will be compromised, especially if the question an- (e.g., passive vs. active voice). Most syntax-based swering system is expected to retrieve answers from QA systems (Wu et al., 2005) incorporate some unrestricted text. Since FrameNet is still under de- means of comparison between the tree representing velopment, its coverage tends to be more of a prob- the question with the subtree surrounding the answer lem in comparison to other semantic role resources candidate. The assumption here is that appropriate such as PropBank. answers are more likely to have syntactic relations In this paper we propose an answer extraction in common with their corresponding question. Syn- model which effectively incorporates FrameNet- tactic structure matching has been applied to pas- style semantic role information. We present an auto- sage retrieval (Cui et al., 2005) and answer extrac- matic method for semantic role assignment which is tion (Shen and Klakow, 2006). conceptually simple and does not require extensive Narayanan and Harabagiu (2004) were the first feature engineering. A key feature of our approach to stress the importance of semantic roles in an- is the comparison of dependency relation paths at- swering complex questions. Their system identifies tested in the FrameNet annotations and raw text. We predicate argument structures by merging semantic formalize the search for an optimal role assignment role information from PropBank and FrameNet. Ex- as an optimization problem in a bipartite graph. This pected answers are extracted by performing proba- formalization allows us to find an exact, globally op- bilistic inference over the predicate argument struc- timal solution. The graph-theoretic framework goes tures in conjunction with a domain specific topic some way towards addressing coverage problems re- model. Sun et al. (2005) incorporate semantic analy- lated with FrameNet and allows us to formulate an- sis in their TREC05 QA system. They use ASSERT swer extraction as a graph matching problem. As a (Pradhan et al., 2004), a publicly available shallow byproduct of our main investigation we also exam- semantic parser trained on PropBank, to generate ine the issue of FrameNet coverage and show how predicate-argument structures which subsequently much it impacts performance in a TREC-style ques- form the basis of comparison between question and tion answering setting. answer sentences. They find that semantic analysis In the following section we provide an overview does not boost performance due to the low recall of existing work on question answering systems that of the semantic parser. Kaisser (2006) proposes a

13 Model I Q SemStruc q ule is shown in Figure 1. Semantic structures for questions and sentences are automatically derived Model II ac SemStruc 1 Answer using the model described in Section 4 (Model I). A ac Model I SemStruc 2 semantic structure SemStruc = hp,Set(SRA)i con- Sent. sists of a predicate p and a set of semantic role as- ac SemStruc i signments Set(SRA). p is a word or phrase evok- ing a frame F of FrameNet. A semantic role assignment SRA is a ternary structure hw,SR,si, consist- Figure 1: Architecture of answer extraction ing of frame element w, its semantic role SR, and question paraphrasing method based on FrameNet. score s indicating to what degree SR qualifies as a Questions are assigned semantic roles by matching label for w. their dependency relations with those attested in the For a question q, we generate a semantic struc- q FrameNet annotations. The assignments are used to ture SemStruc . Question words, such as what, who, create question reformulations which are submitted when, etc., are considered expected answer phrases to Google for answer extraction. The semantic role (EAPs). We require that EAPs are frame elements q assignment module is not probabilistic, it relies on of SemStruc . Likely answer candidates are ex- strict matching, and runs into severe coverage prob- tracted from answer sentences following some pre- lems. processing steps detailed in Section 6. For each In line with previous work, our method exploits candidate ac, we derive its semantic structure ac syntactic information in the form of dependency re- SemStruc and assume that ac is a frame ele- ac lation paths together with FrameNet-like semantic ment of SemStruc . Question and answer seman- roles to smooth lexical and syntactic divergences be- tic structures are compared using a model based on tween question and answer sentences. Our approach graph matching detailed in Section 5 (Model II). is less domain dependent and resource intensive than We calculate the similarity of all derived pairs q ac Narayanan and Harabagiu (2004), it solely employs hSemStruc ,SemStruc i and select the candidate a dependency parser and the FrameNet database. In with the highest value as an answer for the question. contrast to Kaisser (2006), we model the semantic 4 Semantic Structure Generation role assignment and answer extraction tasks numer- ically, thereby alleviating the coverage problems en- Our method crucially exploits the annotated sen- countered previously. tences in the FrameNet database together with the output of a dependency parser. Our guiding assump- 3 Problem Formulation tion is that sentences that share dependency relations will also share semantic roles as long as they We briefly summarize the architecture of the QA evoke the same or related frames. This is motivated system we are working with before formalizing the by much research in lexical semantics (e.g., Levin mechanics of our FrameNet-based answer extraction (1993)) hypothesizing that the behavior of words, module. In common with previous work, our over- particularly with respect to the expression and in- all approach consists of three stages: (a) determining terpretation of their arguments, is to a large ex- the expected answer type of the question, (b) retriev- tent determined by their meaning. We first describe ing passages likely to contain answers to the ques- how predicates are identified and then introduce our tion, and (c) performing a match between the ques- model for semantic role labeling. tion words and retrieved passages in order to extract the answer. In this paper we focus on the last stage: Predicate Identification Predicate candidates are question and answer sentences are normalized to a identified using a simple look-up procedure which FrameNet-style representation and answers are re- compares POS-tagged tokens against FrameNet en- trieved by selecting the candidate whose semantic tries. For efficiency reasons, we make the simplify- structure is most similar to the question. ing assumption that questions have only one predi- The architecture of our answer extraction mod- cate which we select heuristically: (1) verbs are pre-

14 ferred to other parts of speech, (2) if there is more w SR w SR than one verb in the question, preference is given to the verb with the highest level of embedding in the dependency tree, (3) if no verbs are present, a noun is chosen. For example, in Q: Who beat Floyd Pat- terson to take the title away?, beat, take away, and title are identified as predicate candidates and beat is selected the main predicate of the question. For answer sentences, we require that the predicate is ei- (a) (b) ther identical or semantically related to the question predicate (see Section 5). Figure 2: Sample original bipartite graph (a) and its In the example given above, the predicate beat subgraph with edge covers (b). In each graph, the evoques a single frame (i.e., Cause harm). However, left partition represents frame elements and the right predicates often have multiple meanings thus evo- partition semantic roles. quing more than one frame. Knowing which is the frequency of RSR in FrameNet (P(RSR)). We con- appropriate frame for a given predicate impacts the sider both core and non-core semantic roles instan- semantic role assignment task; selecting the wrong tiated by frames with at least one annotation in frame will unavoidably result in erroneous semantic FrameNet. Core roles tend to have more annotations roles. Rather than disambiguiting polysemous pred- in Framenet and consequently are considered more icates prior to semantic role assignment, we perform probable. the assignment for each frame evoqued by the pred- We measure sim(Rw,RSR), by adapting a string icate. kernel to our task. Our hypothesis is that the more Semantic Role Assignment Before describing common substrings two dependency paths have, our approach to semantic role labeling we define the more similar they are. The string kernel we dependency relation paths. A relation path R is a used is similar to Leslie (2002) and defined as relation sequence hr1,r2,...,rLi, in which rl (l = the sum of weighted common dependency rela- 1,2,...,L) is one of predefined dependency relations tion subsequences between Rw and RSR. For effi- with suffix of traverse direction. An example of a ciency, we consider only unigram and bigram sub- relation path is R = hsub jU ,ob jDi, where the subsequences. Subsequences are weighted by a metric scripts U and D indicate upward and downward akin to t f · id f which measures the degree of asso- movement in trees, respectively. Given an unanno- ciation between a candidate SR and the dependency tated sentence whose roles we wish to label, we as- relation r present in the subsequence: sume that words or phrases w with a dependency µ ¶ N path connecting them to p are frame elements. Each (3) weightSR(r) = fr · log 1 + frame element is represented by an unlabeled depen- nr dency path R which we extract by traversing the w where f is the frequency of r occurring in SR; N is dependency tree from w to p. Analogously, we ex- r the total number of SRs evoked by a given frame; tract from the FrameNet annotations all dependency and n is the number of SRs containing r. paths R that are labeled with semantic role infor- r SR For each frame element we thus generate a set mation and correspond to p. We next measure the of semantic role assignments Set(SRA). This initial compatibility of labeled and unlabeled paths as fol- assignment can be usefully represented as a com- lows: plete bipartite graph in which each frame element s(w,SR) = (2) (word or phrase) is connected to the semantic roles max [sim(R ,R ) · P(R )] RSR∈M w SR SR licensed by the predicate and vice versa. (see Fig- where M is the set of dependency relation paths ure 2a). Edges are weighted and represent how com- for SR in FrameNet, sim(Rw,RSR) the similarity be- patible the frame elements and semantic roles are tween paths Rw and RSR weighted by the relative (see equation (2)). Now, for each frame element w 15 Q: Who discovered prions? tween the frame element node ndw and semantic role S: 1997: Stanley B. Prusiner, United States, discovery of prions, ... SR q ac node nd . Edge covers can be computed efficiently SemStruc (ac: Stanley B. Prusiner) SemStruc in cubic time using algorithms for the equivalent p: discover p: discovery linear assignment problem. Our experiments used Original SR assignments: Original SR assignments: Jonker and Volgenant’s (1987) solver.2 0.06 Cognizer 0.25 Cognizer Figure 3 shows the semantic role assignments EAP 0 0.01 ac 0.07 0.12 0 0 Phenomenon 0 0 0.1 0 0.15 Phenomenon generated by our model for the question Q: Who 0.05 Evidence 0.2 discovered prions? and the candidate answer sen- Evidence prions 0.05 State prions 0.02 0.16 tence S: 1997: Stanley B. Prusiner, United States, Ground Topic discovery of prions... Here we identify two predi- Optimized SR assignments: Optimized SR assignments: cates, namely discover and discovery. The expected

0.06 Cognizer 0.25 Cognizer answer phrase (EAP) who and the answer candi- EAP ac Phenomenon 0.1 date Stanley B. Prusiner are assigned the COGNIZER 0.15 Phenomenon 0.05 Evidence role. Note that frame elements can bear multiple se- 0.2 Evidence prions 0.05 State prions mantic roles. By inducing a soft labeling we hope to 0.02 0.16 Ground Topic render the matching of questions and answers more robust, thereby addressing to some extent the cover- Figure 3: Semantic structures induced by our model age problems associated with FrameNet. for a question and answer sentence 5 Semantic Structure Matching we could simply select the semantic role with the We measure the similarity between a question and highest score. However, this decision procedure is its candidate answer by matching their predicates local, i.e., it yields a semantic role assignment for and semantic role assignments. Since SRs are frame- each frame element independently of all other ele- specific, we prioritize frame matching to SR match- ments. We therefore may end up with the same role ing. Two predicates match if they evoke the same being assigned to two frame elements or with frame frame or one of its hypernyms (or hyponyms). The elements having no role at all. We remedy this short- latter are expressed by the Inherits From and Is In- coming by treating the semantic role assignment as herited By relations in the frame definitions. If the a global optimization problem. predicates match, we examine whether the assigned Specifically, we model the interaction between all semantic roles match. Since we represent SR assign- pairwise labeling decisions as a minimum weight ments as graphs with edge covers, we can also for- bipartite edge cover problem (Eiter and Mannila, malize SR matching as a graph matching problem. 1997; Cormen et al., 1990). An edge cover is a sub- The similarity between two graphs is measured graph of a bipartite graph so that each node is linked as the sum of similarities between their subgraphs. to at least one node of the other partition. This yields We first decompose a graph into subgraphs consist- a semantic role assignment for all frame elements ing of one frame element node w and a set of SR (see Figure 2b where frame elements and roles are nodes connected to it. The similarity between two adjacent to an edge). Edge covers have been success- subgraphs SubG1, and SubG2 is then formalized as: fully applied in several natural language processing tasks, including machine translation (Taskar et al., (5) 2005) and annotation projection (Pado´ and Lapata, Sim(SubG1,SubG2) = 1 2006). ∑ w SR w SR SR |s(nd ,nd ) − s(nd ,nd )| + 1 nd1 ∈ SubG1 1 2 Formally, optimal edge cover assignments are so- SR nd2 ∈ SubG2 SR = ndSR lutions of following optimization problem: nd1 2 SR SR w SR where, nd1 and nd2 are semantic role nodes con- (4) max ∏ s(nd ,nd ) w E nected to a frame element node nd in SubG and is edge cover (ndw,ndSR)∈E 1 2The software is available from http://www.magiclogic. where, s(ndw,ndSR) is the compatibility score be- com/assignment.html .

16 439 40 that evoke the same frame and are labeled with the [51, 100] [101, INF) same POS tag. For example, found and establish

2117 are both instances of the frame Intentionally create [21, 50] 3380 but the database does not have any annotated sen- 0 tences for found.v. In default of not assigning any role labels for found.v, our model employs the relation paths for the semantically related establish.v. Preprocessing Here we summarize the steps of 1757 [11, 20] our QA system preceding the assignment of seman-

1175 tic structure and answer extraction. For each ques- [1, 5] tion, we recognize its expected answer type (e.g., in 1287 [6, 10] Q: Which record company is Fred Durst with? we would expect the answer to be an ORGANIZA- Figure 4: Distribution of Numbers of Predicates and TION). Answer types are determined using classi- annotated sentences; each sub-pie, lists the number fication rules similar to Li and Roth (2002). We also of predicates (above) with their corresponding range reformulate questions into declarative sentences fol- of annotated sentences (below) lowing the strategy proposed in Brill et al. (2002). The reformulated sentences are submitted as w sr w SR queries to an IR engine for retrieving sentences with SubG2, respectively. s(nd ,nd1 ) and s(nd ,nd2 ) are edge weights between two nodes in correspond- relevant answers. Specifically, we use the Lemur ing subgraphs (see (2)). Our intuition here is that Toolkit3, a state-of-the-art language model-driven the more semantic roles two subgraphs share for a search engine. We work only with the 50 top-ranked given frame element, the more similar they are and sentences as this setting performed best in previ- the closer their corresponding edge weights should ous experiments of our QA system. We also add to be. Edge weights are normalized by dividing by the Lemur’s output gold standard sentences, which con- sum of all edges in a subgraph. tain and support an answer for each question. Specif- ically, documents relevant for each question are re- 6 Experimental Setup trieved from the AQUAINT Corpus4 according to TREC supplied judgments. Next, sentences which Data All our experiments were performed on the match both the TREC provided answer pattern and TREC02–05 factoid questions. We excluded NIL at least one question key word are extracted and their questions since TREC doesn’t supply an answer for suitability is manually judged by humans. The set of them. We used the FrameNet V1.3 lexical database. relevant sentences thus includes at least one sentence It contains 10,195 predicates grouped into 795 se- with an appropriate answer as well as sentences that mantic frames and 141,238 annotated sentences. do not contain any answer specific information. This Figure 4 shows the number of annotated sentences setup is somewhat idealized, however it allows us to available for different predicates. As can be seen, evaluate in more detail our answer extraction mod- there are 3,380 predicates with no annotated sen- ule (since when an answer is not found, we know it tences and 1,175 predicates with less than 5 anno- is the fault of our system). tated sentences. All FrameNet sentences, questions, Relevant sentences are annotated with their and answer sentences were parsed using MiniPar named entities using Lingpipe5, a MUC-based (Lin, 1994), a robust dependency parser. named entity recognizer. When we successfully As mentioned in Section 4 we extract depen- classify a question with an expected answer type dency relation paths by traversing the dependency tree from the frame element node to the predicate 3See http://www.lemurproject.org/ for details. 4 node. We used all dependency relations provided This corpus consists of English newswire texts and is used as the main document collection in official TREC evaluations. by MiniPar (42 in total). In order to increase cov- 5The software is available from www.alias-i.com/ erage, we combine all relation paths for predicates lingpipe/

17 (e.g., ORGANIZATION in the example above), we overlap at all). assume that all NPs attested in the set of relevant sentences with the same answer type are candidate 7 Results answers; in cases where no answer type is found (e.g., as in Q: What are prions made of?), all NPs Our evaluation was motivated by the following ques- in the relevant answers set are considered candidate tions: (1) How does the incompleteness of FrameNet answers. impact QA performance on the TREC data sets? In particular, we wanted to examine whether there are Baseline We compared our answer extraction questions for which in principle no answer can be method to a QA system that exploits solely syntac- found due to missing frame entries or missing an- tic information without making use of FrameNet or notated sentences. (2) Are all questions and their any other type of role semantic annotations. For each corresponding answers amenable to a FrameNet- question, the baseline identifies key phrases deemed style analysis? In other words, we wanted to assess important for answer identification. These are verbs, whether questions and answers often evoke the same noun phrases, and expected answer phrases (EAPs, or related frames (with similar roles). This is a pre- see Section 3). All dependency relation paths con- requisite for semantic structure matching and ulti- necting a key phrase and an EAP are compared to mately answer extraction. (3) Do the graph-based those connecting the same key phrases and an an- models introduced in this paper bring any perfor- swer candidate. The similarity of question and an- mance gains over state-of-the-art shallow semantic swer paths is computed using a simplified version parsers or more conventional syntax-based QA sys- 6 of the similarity measure proposed in Shen and tems? Recall that our graph-based models were de- Klakow (2006). signed especially for the QA answer extraction task. Our second baseline employs Shalmaneser (Erk Our results are summarized in Tables 1–3. Table 1 and Pado,´ 2006), a publicly available shallow se- records the number of questions to be answered for 7 mantic parser , for the role labeling task instead of the TREC02–05 datasets (Total). We also give infor- the graph-based model presented in Section 4. The mation regarding the number of questions which are software is trained on the FrameNet annotated sen- in principle unanswerable with a FrameNet-style se- tences using a standard feature set (see Carreras and mantic role analysis. Marquez` (2005) for details). We use Shalmaneser Column NoFrame shows the number of questions to parse questions and answer sentences. The parser which don’t have an appropriate frame or predicate makes hard decisions about the presence or absence in the database. For example, there is currently no of a semantic role. Unfortunately, this prevents us predicate entry for sponsor or sink (e.g., Q: Who from using our method for semantic structure match- is the sponsor of the International Criminal Court? ing (see Section 5) which assumes a soft labeling. and Q: What date did the Lusitania sink?). Column We therefore came up with a simple matching strat- NoAnnot refers to questions for which no semantic egy suitable for the parser’s output. For question role labeling is possible because annotated sentences and answer sentences matching in their frame as- for the relevant predicates are missing. For instance, signment, phrases bearing the same semantic role as there are no annotations for win (e.g., Q: What divi- the EAP are considered answer candidates. The lat- sion did Floyd Patterson win?) or for hit (e.g., Q: ter are ranked according to word overlap (i.e., iden- What was the Beatles’ first number one hit?). This tical phrases are ranked higher than phrases with no problem is not specific to our method which admit- 6Shen and Klakow (2006) use a dynamic time warping al- tedly relies on FrameNet annotations for performing gorithm to calculate the degree to which dependency relation the semantic role assignment (see Section 4). Shal- paths are correlated. Correlations for individual relations are es- timated from training data whereas we assume a binary value (1 low semantic parsers trained on FrameNet would for identical relations and 0 otherwise). The modification was also have trouble assigning roles to predicates for necessary to render the baseline system comparable to our an- which no data is available. swer extraction model which is unsupervised. 7The software is available from http://www.coli. Finally, column NoMatch reports the number of uni-saarland.de/projects/salsa/shal/ . questions which cannot be answered due to frame

18 Data Total NoFrame NoAnnot NoMatch Rest TREC02 444 87 (19.6) 29 (6.5) 176 (39.6) 152 (34.2) TREC03 380 55 (14.5) 30 (7.9) 183 (48.2) 112 (29.5) TREC04 203 47 (23.1) 14 (6.9) 67 (33.0) 75 (36.9) TREC05 352 70 (19.9) 23 (6.5) 145 (41.2) 114 (32.4)

Table 1: Number of questions which cannot be answered using a FrameNet style semantic analysis; numbers in parentheses are percentages of Total (NoFrame: frames or predicates are missing; NoAnnot: annotated sentences are missing, NoMatch: questions and candidate answers evoke different frames. mismatches. Consider Q: What does AARP stand Model TREC02 TREC03 TREC04 TREC05 for? whose answer is found in S: The American SemParse 13.16 8.92 17.33 13.16 Association of Retired Persons (AARP) qualify for SynMatch 35.53∗ 33.04∗ 40.00∗ 36.84∗ discounts.... The answer and the question evoke dif- SemMatch 53.29∗† 49.11∗† 54.67∗† 59.65∗† ferent frames; in fact here a semantic role analysis is not relevant for locating the right answer. As can be Table 2: System Performance on subset of TREC seen NoMatch cases are by far the most frequent. datasets (see Rest column in Table 1); ∗: signifi- The number of questions remaining after excluding cantly better than SemParse; †: significantly better NoFrame, NoAnnot, and NoMatch are shown under than SynMatch (p < 0.01, using a χ2 test). the Rest heading in Table 1. These results indicate that FrameNet-based se- Model TREC02 TREC03 TREC04 TREC05 ∗ ∗ ∗ ∗ mantic role analysis applies to approximately 35% SynMatch 32.88 30.70 35.95 34.38 of the TREC data. This means that an extraction +SemParse 25.23 23.68 28.57 26.70 ∗† ∗† ∗† ∗† module relying solely on FrameNet will have poor +SemMatch 38.96 35.53 42.36 41.76 performance, since it will be unable to find answers for more than half of the questions beeing asked. We Table 3: System Performance on TREC datasets (see nevertheless examine whether our model brings any Total column in Table 1); ∗: significantly better than performance improvements on this limited dataset +SemParse; †: significantly better than SynMatch which is admittedly favorable towards a FrameNet (p < 0.01, using a χ2 test). style analysis. Table 2 shows the results of our an- tem to yield better performance over a purely syn- swer extraction module (SemMatch) together with tactic answer extraction module. The two systems two baseline systems. The first baseline uses only were combined as follows. Given a question, we first dependency relation path information (SynMatch), pass it to our FrameNet model; if an answer is found, whereas the second baseline (SemParse) uses Shal- our job is done; if no answer is returned, the ques- maneser, a state-of-the-art shallow semantic parser tion is passed on to SynMatch. Our results are given for the role labeling task. We consider an answer in Table 3. +SemMatch and +SemParse are ensem- correct if it is returned with rank 1. As can be seen, ble systems using SynMatch together with the QA SemMatch is significantly better than both Syn- specific role labeling method proposed in this pa- Match and SemParse, whereas the latter is signifi- per and Shalmaneser, respectively. We also compare cantly worse than SynMatch. these systems against SynMatch on its own. Although promising, the results in Table 2 are not We can now attempt to answer our third ques- very informative, since they show performance gains tion concerning our model’s performance on the on partial data. Instead of using our answer extrac- TREC data. Our experiments show that a FrameNet- tion model on its own, we next combined it with the enhanced answer extraction module significantly syntax-based system mentioned above (SynMatch, outperforms a similar module that uses only syn- see also Section 6 for details). If FrameNet is indeed tactic information (compare SynMatch and +Sem- helpful for QA, we would expect an ensemble sys- Match in Table 3). Another interesting finding is that

19 the shallow semantic parser performs considerably investigate the potential of our model for shallow worse in comparison to our graph-based models and semantic parsing since our experience so far has the syntax-based system. Inspection of the parser’s shown that it achieves good recall. output highlights two explanations for this. First, the Acknowledgements We are grateful to Sebastian Pado´ shallow semantic parser has difficulty assigning ac- for running Shalmaneser on our data. Thanks to Frank Keller curate semantic roles to questions (even when they and Amit Dubey for insightful comments and suggestions. The authors acknowledge the support of DFG (Shen; PhD stu- are reformulated as declarative sentences). And sec- dentship within the International Postgraduate College “Lan- ondly, it tends to favor precision over recall, thus re- guage Technology and Cognitive Systems”) and EPSRC (Lap- ducing the number of questions for which answers ata; grant EP/C538447/1). can be found. A similar finding is reported in Sun et al. (2005) for a PropBank trained parser. References 8 Conclusion E. Brill, S. Dumais, M. Banko. 2002. An analysis of the askMSR question-answering system. In Proceedings of the EMNLP, 257–264, Philadelphia, PA. In this paper we assess the contribution of semantic X. Carreras, L. Marquez,` eds. 2005. Proceedings of the role labeling to open-domain factoid question an- CoNLL shared task: Semantic role labelling, 2005. swering. We present a graph-based answer extrac- T. Cormen, C. Leiserson, R. Rivest. 1990. Introduction tion model which effectively incorporates FrameNet to Algorithms. MIT Press. H. Cui, R. X. Sun, K. Y. Li, M. Y. Kan, T. S. Chua. style role semantic information and show that it 2005. Question answering passage retrieval using de- achieves promising results. Our experiments show pendency relations. In Proceedings of the ACM SIGIR, that the proposed model can be effectively combined 400–407. ACM Press. with a syntax-based system to obtain performance T. Eiter, H. Mannila. 1997. Distance measures for superior to the latter when used on its own. Fur- point sets and their computation. Acta Informatica, 34(2):109–133. thermore, we demonstrate performance gains over a K. Erk, S. Pado.´ 2006. Shalmaneser - a flexible toolbox shallow semantic parser trained on the FrameNet an- for semantic role assignment. In Proceedings of the notated corpus. We argue that performance gains are LREC, 527–532, Genoa, Italy. due to the adopted graph-theoretic framework which C. Fellbaum, ed. 1998. WordNet. An Electronic Lexical Database. MIT Press, Cambridge/Mass. is robust to coverage and recall problems. C. J. Fillmore, C. R. Johnson, M. R. Petruck. 2003. We also provide a detailed analysis of the appro- Background to FrameNet. International Journal of priateness of FrameNet for QA. We show that per- Lexicography, 16:235–250. formance can be compromised due to incomplete D. Gildea, D. Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245– coverage (i.e., missing frame or predicate entries 288. as well as annotated sentences) but also because of T. Grenager, C. D. Manning. 2006. Unsupervised dis- mismatching question-answer representations. The covery of a statistical verb lexicon. In Proceedings of question and the answer may evoke different frames the EMNLP, 1–8, Sydney, Australia. R. Jonker, A. Volgenant. 1987. A shortest augmenting or the answer simply falls outside