Working Notes of NTCIR-4, , 2-4 June 2004 ASKMi at NTCIR-4 QAC2

Tetsuya Sakai Yoshimi Saito Yumi Ichimura Makoto Koyama Tomoharu Kokubu Knowledge Media Laboratory, Toshiba Corporate R&D Center [email protected]

Abstract Table 1. TSB Formal Run Results based Toshiba participated in NTCIR-4 QAC2 Task 1: this on QAC2formalAnsTask1 040308. is our first QAC participation. Using our newly de- Run Name MRR Description veloped Japanese QA system called ASKMi, we sub- TSB-A 0.454 ASKMi (BM25) mitted two runs, one using Okapi/BM25 for document TSB-B 0.440 ASKMi (BM25+Boolean) retrieval (TSB-A) and the other using Boolean AND constraints before applying Okapi/BM25 (TSB-B). We achieved the 5th best performance among the 17 par- ticipants (8th and 9th among the 25 submitted runs). 2 ASKMi This paper briefly describes ASKMi, and discusses the formal run results using Reciprocal Rank as well as a new performance metric called Q-measure. ASKMi stands for “Answer Seeker/Knowledge Keywords: ASKMi, Q-measure. Miner”: Figure 1 shows its configuration. The Knowl- edge Miner consists of off-line components, namely, the Semantic Class Recognizer and the Relation Ex- tractor. The Answer Seeker consists of on-line com- 1 Introduction ponents, namely, the Question Analyzer, the Re- triever/Passage Extractor, and the Answer Formula- Toshiba participated in NTCIR-4 QAC2 Task 1: tor. Sections 2.1-2.5 briefly describe each component: this is our first QAC participation. Using our More details can be found in [11]. newly developed Japanese QA system called ASKMi (“ask me”) [11], we submitted two runs, TSB-A (TOSHIBA ASKMi) and TSB-B (TOSHIBA ASKMi 2.1 Semantic Class Recognizer Boolean). The only difference between the two runs is the document retrieval strategy: TSB-A used Okapi/BM25 [14], while TSB-B used a Boolean AND The basic function of the Semantic Class Recog- constraint before applying Okapi/BM25 (See Sec- nizer is rule-based named entity recognition. For tion 2.4). Table 1 provides a quick summary of documents, it performs candidate answer extraction our official results. TSB-A and TSB-B achieved the (or predictive annotation [6]): The strings extracted 8th and 9th best performance among the 25 submit- through this process are called ex-strings, which are ted runs, which places us 5th among the 17 partic- associated with answer extraction confidence values. ipants. (Our analyses are based on the answer file For questions, it performs question abstraction:For QAC2formalAnsTask1 040308 throughout this pa- example, given Japanese questions such as “Toshiba per.) no shachøo (Toshiba’s president)” and “Maikurosofuto The remainder of this paper is organised as follows. no fukushachøo (Microsoft’s vice president)”, question Section 2 briefly describes ASKMi. Section 3 briefly abstraction converts them into “COMPANY no POSI- describes how Q-measure, an information retrieval TION”, where no is the Japanese possessive particle. performance metric for multigrade relevance, can be Currently, the Semantic Class Recognizer main- applied to QA evaluation. Using both Reciprocal Rank tains an Answer Type Taxonomy that consists of more and Q-measure, Section 4 analyses our formal run re- than 100 answer types. In [5], we have studied sults as well as some “oracle” runs (e.g. [15]). Finally, how named entity recognition performance and an- Section 5 concludes this paper. We report on our work swer type granularity affect QA performance using the for the NTCIR-4 CLIR task in a separate paper [12]. QAC1 test collection.

© 2004 National Institute of Informatics External Documents Question Answers knowledge sources Answer formulator

Semantic class Question analyzer Recognizer Question Answer type recognition abstraction Relation extractor Query generation Answer type sensitive Candidate query expansion + answer document constraint extraction generation

Answer constraint generation

Encyclopaedic DB

Document DB Retriever Passage extractor

Knowledge Miner (off-line) Answer Seeker (on-line)

Figure 1. ASKMi configuration.

2.2 Relation Extractor 2. Question Type Recognition: Based on Japanese interrogatives found within the question string, The Relation Extractor has been designed to collect a question type such as NANI (what), DARE encyclopaedic knowledge from free-text documents (who), ITSU KARA (from when) or DOKO HE and other knowledge sources such as web pages. It (to where) is assigned to it. creates an encyclopaedic DB for dealing with ques- tions like “DVD towa nani? (What is DVD?)”. How- 3. Answer Type Recognition (or question classifi- ever, for QAC2, the Relation Extractor managed to cation [3]): Each question class is mapped to one answer only one question, namely, QAC2-10036-01 or more answer category (i.e. a group of pos- “What does ADSL stand for?”. We plan to improve sible answer types). Then, SRA pattern match- the accuracy and coverage of the Relation Extractor ing is applied to determine the final answer types. by utilising the Semantic Class Recognizer in the near To each answer type, an answer type recognition future. confidence value is attached.

2.3 Question Analyzer 2.3.2 Query Generation Query generation refers not only to generating a set The Question Analyzer has several functions in- of search terms from a natural language question, but cluding answer type recognition, query generation and also to answer type sensitive query expansion and doc- answer constraint generation. These functions rely ument constraint generation. on Semantic Role Analysis (SRA), which can assign Answer type sensitive query expansion is designed arbitrary labels to textual fragments based on pattern to enhance document retrieval performance. For ex- matching rules with priorities among them [8, 9, 11]. ample, if the question string contains the Japanese ex- pression “umareta (was born)”, and if the expected an- 2.3.1 Answer Type Recognition swer type is “DATE”, then words such as “seinengappi (date of birth)” are added to the original question. In When ASKMi receives a question from the user, it contrast, if the expected answer type is “LOCATION”, goes through the following steps: then words such as “shusshinchi (place of birth)” are added. Of course, answer type insensitive query ex- 1. Question Abstraction: As mentioned earlier, both pansion is also possible: If the question string con- “Toshiba no shachøo (Toshiba’s president)” and tains expressions such as “mottomo takai (highest)” or “Maikurosofuto no fukushachøo (Microsoft’s vice “mottomookii ø (largest)”, then a single kanji word that president)” are transformed into “COMPANY no conveys the same meaning (“saikøo”or“saidai”) can POSITION”. be added regardless of answer types. Answer type sensitive document constraint gen- 2.5 Answer Formulator eration creates a database restriction condition from the input question in order to perform The Answer Formulator calculates the scores of high-precision search of the document DB. For candidate answers that are included in the passages. INCIDENT/ACCIDENT-related questions such as An ex-string e is a quadruple that has been extracted happened on September 11, 2001?)”, ASKMi can through candidate answer extraction. It repre- temporarily eliminate all newspaper articles pub- sents a specific occurrence of a candidate answer lished before September 11, 2001 from the database, within a document. A candidate c is a triple < thus reducing the search space. In contrast, for document , answertype, string >, which is obtained EVENT-related questions such as those concerning the by consolidating the ex-strings (i.e. multiple occur- Olympic games, then database restriction may not be rences) within a document. An answer string a is ob- applied, because such major events may be mentioned tained by consolidating candidates across documents in newspapers well before they actually take place. and across answer types. Let C(e) and A(c) repre- sent the mapping from ex-strings to the corresponding candidates, and that from candidates to corresponding 2.3.3 Answer Constraint Generation answer strings, respectively. Let e be an ex-string from a document d. Let t be a Answer constraint generation includes techniques that query term found in d, and let p be a passage extracted are similar to expected numerical range [4]. Answer from d. Firstly, we define the co-occurrence function constraints can also be applied based on Japanese par- as follows: ticles: For example, if the question class is DOKO HE

(to where), answers which are immediately followed distance(e, t, p)=min|pos e(e, p) − pos t(t, i, p)| by “he”or“ni” (Japanese particles indicating destina- i (1) tion) are preferred. Such answer constraints heuris- 1 cooc(e, t, p)=  (2) tically modify the original answer scores at the final (1 + ∗ ( )) answer formulation stage. PC distance e, t, p where pos e(e, p)= position of ex-string e within p; 2.4 Retriever/Passage extractor pos t(t, i, p)= position of the i-th occurrence of t within p; = The ASKMi Retriever uses the same basic compo- PC constant called the co-occurrence ≥ 0 nents as the BRIDJE system [9, 12]. For a given ques- parameter (PC ). tion q and each document d, it calculates the original For a given question q and a document d, the document score (origdscore(q, d)) using Okapi/BM25 score of an ex-string (escore) and that of a candidate term weighting [14], and retrieves a fixed number of (cscore) are calculated as follows: documents that contain candidate answer strings an- ( )= notated with designated answer types. The Retriever escore q, e can optionally perform Boolean search prior to docu- 1  ment ranking, which may enhance retrieval precision cfatr (q, e)∗cfaex (e)∗max( cooc(e, t, p)) (3) p | | at the cost of reducing retrieval recall. It can also per- q t∈q form Pseudo-Relevance Feedback [7], although it was cscore(q, c)= max escore(q, e) (4) not used for QAC2. e,C(e)=c For QAC2, TSB-A used the Okapi/BM25 term weighting. Whereas, TSB-B used Boolean AND op- where cfatr (q, e)= answer type recognition confidence erators prior to term weighting if the number of query for the answer type of e, given q; terms was less than four. If there were four or more cfaex (e)= answer extraction confidence for e. query terms, or if the Boolean search was not success- Whereas, we can optionally modify the BM25- ful, TSB-B fell back to TSB-A. based document score (See Section 2.4) as follows: We did not use the Passage Extractor even though it has several static and dynamic passage selection func- dscore(q, d)=max{0,PD∗(origdscore(q, d)−1)+1} tions [11]. This is because we have not found a pas- (5) sage selection method that is significantly more effec- where tive than using the whole document in terms of the fi- PD = constant called the document score nal QA performance. Our comparison of several static parameter (PD ≥ 0). passage selection methods in terms of document re- Let D(c) represent the mapping from a candidate trieval performance can be found in [10]. c to the corresponding document. We can now calcu- clate the score of an answer string a: Step 2 For each answer string in each answer synset,  assign a correctness level. For example, follow- ascore(q, a)= dscore(q, D(c))∗ cscore(q, c) ing the NTCIR document relevance levels, we c,A(c)=a can use “S-correct” (an excellent answer), “A- (6) correct” (a good answer), and “B-correct” (an ad- Finally, as mentioned in Section 2.3.3, the above equate answer) as answer correctness levels. answer scores are heuristically adjusted based on an- swer constraints. Moreover, answer string consolida- Table 2 provides some examples of how we actu- tion is performed in order to prevent returning multiple ally modified the QAC2 Task 1 answer data. For each answers that mean the same thing. Again, this pro- question, the number of answer synsets is denoted by cess is answer type sensitive: For example, if the an- R, and the i-th answer synset is denoted by AS(i). swer type is POLITICIAN, then an answer string “Ju- QAC2-10001-01, 10074-01 and 10177-01 are the nichiro Koizumi” absorbs other “less complete” can- most simple examples: There is only one answer didates such as “Koizumi” and “Junichiro”. On the synset, but the answer strings vary in informativeness other hand, if the answer type is COMPANY, then an- or completeness. For example, as “Yoshii” and “Mr. swer string consolidation is not performed, because, Sato” are relatively common Japanese surnames, they for example, “Toshiba EMI” and “Toshiba” are differ- are considered to be B-correct. For 10031-01 “Where ent companies even though the former contains the lat- did Antonio Inoki’s retirement match take place?”, ter as a substring. Sakai [13] has shown the effective- we constructed only one answer synset (even though ness of answer string consolidation using the QAC1 the original QAC2 answer data treats the two answer test collection. strings as distinct instances), because “Bunkyo-ku, Tokyo” is merely the address of “Tokyo Dome”, and 3 Application of Q-measure to QA may be a little awkward as an answer to the above question. 10049-01 is an example of assigning cor- rectness levels from the viewpoint of how accurate the Although NTCIR QAC uses Reciprocal Rank for answer is: Since the truly correct answer is “thirty- Task 1 (i.e. “single-answer” task) and F-measure for two years”, probably “over thirty years” is a good an- Task 2 (i.e. “list” task), many seemingly “single- swer, but “thirty years” is probably inaccurate. For answer” questions do in fact have multiple answers 10079-01 “What is the abbreviation for Deoxyribonu- and it is difficult to draw a line between these two cleic Acid?”, “DNA (Deoxyribonucleic Acid)” is a tasks. Reciprocal Rank cannot evaluate multiple an- “not exact” answer [18] although it is treated as cor- swers while F-measure (or TREC Accuracy [18]) can- rect in the QAC2 data. We therefore treated this string not take answer ranking into account. Moreover, the as B-correct. above QA evaluation metrics cannot handle correct- We now look at the examples with multiple answer ness levels, even though some answers may be “more synsets. For QAC2-10135-01, there are two answer correct” than others just as some documents may be synsets, each representing a famous artist. As “Rodin” more relevant than others in IR. and “Klimt” are not common names in , they Sakai [13] has proposed a new evaluation metric are treated as A-correct here, in contrast to “Yoshii” called Q-measure, which can treat “single-answer” for 10001-01 and “Mr. Sato” for 10074-01 which and “list” questions seamlessly, and can handle multi- were treated as B-correct. (Note that, just as docu- ple correctness levels of answers. His experiments us- ment relevance criteria may vary across topics, answer ing the QAC1 test collection strongly suggests that Q- correctness criteria may naturally vary across ques- measure is a better QA metric than Reciprocal Rank. tions.) For 10124-01 and 10157-01, there are as many We therefore use Q-measure along with Reciprocal as 7 and 10 answer synsets, respectively. Recipro- Rank for analysing the QAC2 results. cal Rank is clearly inadequate for dealing with these Our NTCIR-4 CLIR paper [12] contains the mathe- questions. The answer strings for 10157-01 are all A- matical definition of Q-measure as an information re- correct rather than S-correct, as none of them contains trieval performance metric. Below, we briefly describe the first name. how to modify the answer data of a traditional QA test Clearly, constructing answer synsets and assigning collection (such as QAC2) so that Q-measure can be correctness levels require subjective judgment. How- applied to QA evaluation. ever, experience from IR and QA evaluations suggests Step 1 Construct answer synsets, or equivalence class that differences in subjective judgments have limited of answer strings, in order to avoid rewarding sys- impact on performance comparisons [16, 17], and this tems that return “duplicate” answers. This is not paper assumes that the above finding holds for answer difficult for the QAC test collections, as a kind synset construction and correctness level assignment. of equivalence class information is already in- We plan to test this hypothesis in the near future. cluded in the answer data (for evaluation with F- Although it is not impossible to include support- measure). ing document information during the process of an- Table 2. Examples of QAC2 answer synsets and correctness levels. QAC2-10001-01 (R =1) AS(1) = {<“Masato Yoshii”,S>, <“Yoshii”,B>} QAC2-10031-01 (R =1) AS(1) = {<“Tokyo Dome”,S>, <“Bunkyo-ku, Tokyo”,A>} QAC2-10049-01 (R =1) AS(1) = {...,<“thirty-two years”,S>, <“over thirty years”,A>, <“thirty years”,B>} QAC2-10074-01 (R =1) AS(1) = {<“former prime minister Eisaku Sato”, S>, <“Eisaku Sato”, S>, <“Mr. Sato”,B>} QAC2-10079-01 (R =1) AS(1) = {<“DNA”, S>, <“DNA (Deoxyribonucleic Acid)”, B>} QAC2-10124-01 (R =7) AS(1) = {<“Amalthea”, S>, ...} AS(2) = {<“Adrastea”, S>} AS(3) = {<“Io”, S>, <“Satellite Io”, B>} . . QAC2-10135-01 (R =2) AS(1) = {<“Auguste Rodin”, S>, <“Rodin”, A>} AS(2) = {<“Gustav Klimt”, S>, <“Klimt”, A>} QAC2-10157-01 (R =10) AS(1) = {<“Lenin”, A>} AS(2) = {<“Former Prime Minister Stalin”, A>, <“Stalin”, A>} . . AS(7) = {<“Former President Gorbachev”, A>, <“Gorbachev”, A>} . . QAC2-10177-01 (R =1) AS(1) = {<“New Delhi, India”, S>, <“New Delhi”, A>, <“India”, A>}

swer synset construction and correctness level assign- MRR calculation method used at QAC2. ment, we use the lenient evaluation methodology [18] even though QAC2 officially uses the strict one. This 4 Analysis is because (a) While providing a supporting passage is probably of practical importance, forcing a system 4.1 Formal runs to name a supporting document from a closed doc- ument collection may not be the best way to evalu- The Boolean AND constraints used for TSB-B ate open-domain QA; and (b) The “lenient MRRs” of were not at all successful: it improved only one ques- TSB-A and TSB-B are actually identical to the “strict” tion and hurt four questions. Although a question fo- (i.e. official) ones. Thus, given a correct answer, find- cus type of approach [1] may be useful for determin- ing a good supporting document is not difficult with ing which of the query terms should be included in the ASKMi. Boolean expression, we have not found an effective We are now ready to apply Q-measure to QA eval- way to do this. Hereafter, we analyse TSB-A only. uation, by treating any ranked output of a QA system Table 3 provides a failure analysis for TSB-A, fol- as a document retrieval output. Figures 2 and 3 show lowing our analysis method used for the QAC1 addi- how to calculate Q-measure (and R-measure [13]) for tional questions in [11]. Answer type recognition is re- QA evaluation. The first algorithm identifies the cor- sponsible for Rows (b) and (e), the Retriever is respon- rect answer strings, but ignores duplicate answers. The sible for Row (c), and the Semantic Class Recognizer second algorithm simply calculates Q-measure exactly is responsible for Row (d). (Note that this classifica- as defined in [12, 13]. Hereafter we use gain(S)= tion is based on which module caused the first prob- 3,gain(A)=2and gain(B)=1as the gain val- lem: For example, if both answer type recognition and ues [12, 13]. (Ideally, the system output size L should retrieval were unsuccessful, the former is held respon- be sufficiently large so that L ≥ R for all questions. sible.) At the time of submission, we were fully aware However, we follow the TREC/NTCIR traditions and that our answer type recognition rules and semantic use L =5.) Although Q-measure can properly handle class recognition rules were not as sophisticated as we NIL questions [13], we exclude QAC2-10198-01 and wished them to be. As this is only our first QAC par- 10199-01 from our evaluation, following the official ticipation, we plan to do better in the next round of /* initialize flag for each answer synset. rmax=max(L,R); /* L: system output size */ The flags avoid marking multiple answers from /* R: #answer synsets */ the same answer synset. */ for( i=1; i<=R; i++ ) flag[i]=0; /* obtain cumulative gains for the IDEAL ranked output */ r=1; /* system output rank */ r=0; cig[0]=0; while read o(r){ /* system’s r-th answer */ for each X in (S,A,B) { /* X: correctness level */ if( there exists a(i,j) s.t. o(r)==a(i,j) ){ for( k=1; k<=R(X); k++ ){ /* o(r) matches with a correct answer */ /* R(X): #answer synsets in which the if( o(r)=="NIL" ){ highest correctness level is X. */ /* special treatment of NIL */ r++; if( r==1 ){ /* i.e. NIL at Rank 1 */ cig[r]=cig[r-1]+gain(X); print o(r), x(i,j); } /* marked as correct */ } } for( r=R+1; r<=rmax; r++ ){ /* in case L>R */ else{ cig[r]=cig[R]; print o(r); } /* NOT marked as correct */ } /* obtain cumulative bonused gains for } the system output */ else{ /* not NIL */ r=0; cbg[0]=0; if( flag[i]==0 ){ for( r=1; r<=L; r++ ){ /* AS(i) is a NEW answer synset */ if( o(r) is marked with X ){ print o(r), x(i,j); cbg[r]=cbg[r-1]+gain(X)+1; /* marked as correct */ } flag[i]=1; else{ } cbg[r]=cbg[r-1]; else{ /* i.e. flag[i]==1 */ } print o(r); } /* duplicate answer from AS(i) for( r=L+1; r<=rmax; r++ ){ /* in case Lcbg[r-1] ){ } /* i.e. correct answer at Rank r */ r++; /* examine next rank */ sum+=cbg[r]/(cig[r]+r); } } } Q-measure=sum/R; Figure 2. Algorithm for marking a system R-measure=cbg[R]/(cig[R]+R); output [13]. Figure 3. Algorithm for calculating Q- measure/R-measure [13].

QAC by improving the rules. Table 4(a) and (b) show the Q-measure values for TSB-A and TSB-B. (We will discuss (c)-(e) in Sec- quence is (cbg(1),cbg(2),...)=(0, 0, 4, 4,...) [12, tion 4.2.) Figure 4 shows the value of Q-measure mi- 13]. Whereas, the cumulative ideal gain sequence is nus Reciprocal Rank for each question with TSB-A in (cig(1),cig(2),...)=(3, 3, 3, 3,...), as there is only order to illustrate the advantages of Q-measure. Thus, one answer synset and it contains an S-correct answer the dots above zero imply that Q-measure values are string. (cbg(r) and cig(r) are shown in bold when- higher than Reciprocal Rank values, and the dots be- ever g(r) > 0.) Therefore, Q-measure =(4/(3 + low zero imply the opposite. (Recall that Recipro- 3))/1=0.667. Now, suppose that ASKMi returned cal Rank is either 0, 0.2, 0.25, 0.333, 0.5 or 1 given the B-correct answer “Yoshii” instead at Rank 3: In L =5.) The six “empty” dots near the top/bottom of this case, (cg(1),cg(2),...)=(0, 0, 1, 1,...) and Figure 4 represent the six questions discussed below. (cbg(1),cbg(2),...)=(0, 0, 2, 2,...). Therefore, The “emtpy” dots that are close to the top in Fig- Q-measure =(2/(3 +3))/1=0.333. In this way, ure 4 represent QAC2-10001-01, 10001-03, 10017- Q-measure can reflect the differences in answer cor- 1 and 10161-1: For all of these questions, the first rectness levels. correct answer was at Rank 3 and therefore Re- We now turn to the “emtpy” dots at the bottom ciprocal Rank was 0.333, while Q-measure was as of Figure 4. For QAC2-10124, Reciprocal Rank high as 0.667. This reflects the fact that the an- was 1 as ASKMi returned an S-correct answer at swer was S-correct, and that there was only one an- Rank 1, but Q-measure was only 0.143. This is swer synset (R =1). For example, for 10001- because there are as many as seven answer synsets 01, ASKMi returned the S-correct answer “Masato (See Table 2), each representing a satellite of Jupiter. Yoshii” (See Table 2) at Rank 3, and therefore the As none of the other returned answers were cor- cumulative gain sequence is (cg(1),cg(2),...)= rect, (cbg(1),cbg(2),...)=(4, 4, 4,...). Whereas, (0, 0, 3, 3,...) and the cumulative bonused gain se- (cig(1),cig(2),...)=(3, 6, 9,...) as each answer swer type” (OAT) and “oracle supporting documents” Table 3. Failure analysis of TSB-A. (OSD) (or oracle retriever [15]) experiments, as well as the combination of the two. OAT means that the #questions with a correct answer in top 5 111 ASKMi Question Analyzer is provided with one cor- First correct answer at Rank 1 75 rect answer type for each question (even though there First correct answer at Rank 2 17 may in fact be more than one possibility), and OSD First correct answer at Rank 3 7 means that the ASKMi Retriever searches only the First correct answer at Rank 4 6 supporting documents listed in the QAC2 answer file. First correct answer at Rank 5 6 Table 4(c), (d) and (e) show the performance of our # questions without a correct answer in top 5 84 OAT run, OSD run and the combination of the two, (a) Out of answer type taxonomy 0 respectively. As the highest official MRR at QAC2 (b) Failure to capture the correct answer type 36 was 0.607, improving a single component of ASKMi (c) Failure to retrieve a supporting document 11 (e.g. Question Analyzer or Retriever) is clearly not (d) Failure to extract a correct answer from 21 sufficient for catching up with the top performers. a retrieved supporting document Whereas, as the combination of two “oracles” (TSB- (e) Failure to rank a correct answer within 5 A+OAT+OSD) is very successful, it is probably pos- top 5 due to noise in answer type recognition sible to achieve over 0.6 in MRR by improving all the (f) Others 11 components, including the Semantic Class Recognizer (f1) First correct answer within Ranks 6-10 (4) and the Answer Formulator. In fact, our MRR for the (f2) Other failures (7) QAC1 formal question set is currently over 0.7. Table 4. Q-measure values for the Offi- cial/Oracle runs. 5 Conclusions

Run Name MRR Q-measure Through our first participation at the QAC Japanese (a) TSB-B 0.440 0.391 Question Answering task, we showed that our newly (b) TSB-A 0.454 0.396 developed QA system, ASKMi, shows respectable (c) TSB-A+OAT 0.536 0.463 performance. By improving each of ASKMi’s com- (d) TSB-A+OSD 0.567 0.513 ponents, namely, the Semantic Class Recognizer, the (e) TSB-A+OAT+OSD 0.678 0.591 Question Analyzer, the Retriever and the Answer For- mulator, we hope to catch up with the top perform- ers at QAC eventually. As all of our components are rule-based, semi-automatic acquisition of “human- synset contains an S-correct answer. Therefore, readable” rules (as opposed to “black-box” classifiers) =(4 (3 +1))7=17=0143 Q-measure / / / . . will be one of our future research topics. At the same For 10157-01, ASKMi returned an A-correct answer time, we plan to tackle the problem of Encyclopaedic (“Lenin”) at Rank 1 and another (“Former President QA by improving the accuracy and contents of the Re- Gorbachev”) at Rank 4. Thus, Reciprocal Rank is 1 lation Extractor. even though the question explicitly requests a list of As a by-product of our experiments, it has become 10 Russian politicians. In contrast, Q-measure is only clear that Q-measure is a very useful QA evaluation ( (1) (2) )=(22 2 4 4 ) 0.150: cg ,cg ,... , , , , ,... and metric. We would like to propose the use of Q-measure ( (1) (2) )=(3 3 3 6 6 ) cbg ,cbg ,... , , , , ,... . Whereas, as an official evaluation metric at QAC, and the inte- as all of the correct answer strings for 10157-01 are gration of Task 1 and Task 2 from the next round. only A-correct (See Table 2), (cig(1),cig(2),...)= (2 4 6 8 10 ) = , , , , ,... . Therefore, Q-measure Acknowledgment ((3/(2 +1))+(6/(8 + 4)))/10 = 0.150.Itis clear from these examples that Q-measure can inte- We would like to thank Toshihiko Manabe for his grate “single-answer” task (Task 1) and “list” task valuable suggestions regarding the development of (Task2) seamlessly. Thus, the abundance of dots in the ASKMi. lower region of Figure 4 represents the fact that Re- ciprocal Rank overestimates the system’s performance when there are several possible answers. References

[1] Diekema, A. R. et al.: Question Answering: CNLP 4.2 Upperbound Performance at the TREC-2002 Question Answering Track, TREC- 2002 Proceedings, (2002). We have conducted some additional experiments [2] Fukumoto, J., Kato, T. and Masui, F.: Question An- using the QAC2 answer data for estimating the up- swering Challenge (QAC-1): An Evalution of Ques- perbound performance of ASKMi, namely, “oracle an- tion Answering Tasks at the NTCIR Workshop 3, 0.6 "TSB-A"

0.4

0.2

0

-0.2

-0.4 Qmeasure - Reciprocal Rank

-0.6

-0.8

-1 10000 10020 10040 10060 10080 10100 10120 10140 10160 10180 10200 Question ID

Figure 4. Q-measure minus Reciprocal Rank (TSB-A).

AAAI Spring Symposium: New Directions in Question [11] Sakai, T. et al.: ASKMi: A Japanese Question An- Answering, pp. 122-133 (2003). swering System based on Semantic Role Analysis, RIAO 2004 Proceedings, to appear. (2004). [3] Hermjakob, U.: Parsing and Question Classification for Question Answering, ACL 2001 Workshop on [12] Sakai, T. et al.: Toshiba BRIDJE at NTCIR-4 CLIR: Open-Domain Question Answering (2001). Monolingual/Bilingual IR and Flexible Feedback, NTCIR-4 CLIR Working Notes, to appear (2004). [4] Hovy, E. et al.: Using Knowledge to Facilitate Fac- toid Answer Pinpointing, COLING 2002 Proceedings [13] Sakai, T: New Performance Metrics based on Multi- (2002). grade Relevance: Their Application to Question An- swering, ACM SIGIR 2004 Proceedings, submitted. [5] Ichimura, Y. et al.: A Study of the Relations among (2004). Question Answering, Japanese Named Entity Extrac- tion, and Named Entity Taxonomy (in Japanese), IPSJ [14] Sparck Jones, K., Walker, S. and Robertson, S. E.: A SIG Notes, NLÐ161Ð3, to appear (2004). Probabilistic Model of Information Retrieval: Devel- [6] Prager, J. et al.: Question-Answering by Predictive opment and Comparative Experiments, Information Annotation, ACM SIGIR 2000 Proceedings, pp. 184- Processing and Management 36, pp. 779-808 (Part I) 191 (2000). and pp. 809-840 (Part II), (2000). [7] Sakai, T.: Japanese-English Cross-Language Informa- [15] Tellex, S. et al.: Quantitative Evaluation of Passage tion Retrieval Using Machine Translation and Pseudo- Retrieval Algorithms for Question Answering, ACM Relevance Feedback, IJCPOL, Vol. 14, No. 2, pp. 83Ð SIGIR 2003 Proceedings, (2003). 107 (2001). [16] Voorhees, E. M.: Variations in Relevance Judgments [8] Sakai, T. et al.: Retrieval of Highly Relevant Docu- and the Measurement of Retrieval Effectiveness, ACM ments based on Semantic Role Analysis (in Japanese), SIGIR ’98 Proceedings, pp. 315-323 (1998). FIT 2002 Information Technology Letters, pp. 67Ð68 [17] Voorhees,E. M.: Building A Question Answering Test (2002). Collection, ACM SIGIR 2000 Proceedings, pp. 200- [9] Sakai, T. et al.: BRIDJE over a Language Bar- 207, (2000). rier: Cross-Language Information Access by Integrat- ing Translation and Retrieval, IRAL 2003 Proceed- [18] Voorhees, E. M.: Overview of the TREC 2002 ings, pp.65-76 (2003). http://acl.ldc.upenn. Question Answering Track, TREC 2002 Proceedings, edu/W/W03/W03-1109.pdf (2002).

[10] Sakai, T. and Kokubu, T.: Evaluating Retrieval Per- formance for Japanese Question Answering: What Are Best Passages? ACM SIGIR 2003 Proceedings, pp. 429-430 (2003).