<<

Automated Historical Fact-Checking by Passage Retrieval, Word Statistics, and Virtual Question-Answering Mio Kobayashi1 Ai Ishii1 Chikara Hoshino1 Hiroshi Miyashita1

Takuya Matsuzaki2 1Nihon Unisys Ltd., Japan {mio.kobayashi, ai.ishii, chikara.hoshino, hiroshi.miyashita}@unisys.co.jp 2Nagoya University, Japan [email protected]

Abstract Context: ... During the period of the Carolingian of , the Roman This paper presents a hybrid approach to preached that the religious cleansing of sins was necessary in order to achieve salvation after death. ... the verification of statements about his- Instruction: From (1)-(4) below, choose the one correct torical facts. The test data was collected sentence concerning events during the 8th century when from the world history examinations in the kingdom referred to in the underlined portion was a standardized achievement test for high established. school students. The data includes var- Choices: ious kinds of false statements that were (1) Pepin destroyed the Kingdom of the Lombards. (2) Charlemagne repelled the Magyars. carefully written so as to deceive the stu- (3) The reign of Emperor Taizong of Tang was called dents while they can be disproven on the the Kaiyuan era. basis of the teaching materials. Our sys- (4) The reign of Harun al-Rashid began. tem predicts the truth or falsehood of a statement based on text search, word cooc- currence statistics, factoid-style question Figure 1: Example of a True-or-False question answering, and temporal relation recog- nition. These features contribute to the san et al., 2015), even its partial automation would judgement complementarily and achieved greatly enhance the power of the current fact- the state-of-the-art accuracy. checking services. As a step towards this direction, we take up the 1 Introduction automatic verification of a statement about histori- The proliferation of social media in the Internet cal facts against credible information sources. The drastically changed the status of traditional jour- test statements are collected from the world his- nalism, which has been an indispensable build- tory examinations in a standardized achievement ing block of modern democracy. News are now test for high school students in Japan (the National produced, propagated, and consumed by people in Center Test for University Admissions, NTCUA). quite a different way than twenty years ago (Pew Approximately 60% of the NCTUA world history Research Center, 2016). The downside is that fake exams are “True-or-False” questions. A question news and hoaxes spread through the social net- in this format consists of a paragraph of text that work as quickly as those from trustable sources. provides the context of the question, an instruc- A mechanism for fact-checking, i.e., finding a sup- tion sentence, and four choices (Fig. 1)1. One has port or disproof of a claim in a credible informa- to choose a correct or incorrect statement from the tion source, is thus needed as a new social infras- four choices according to the instruction. tructure. The test statements in the True-or-False ques- The sheer amount of the information flow as tions are thoroughly tuned and checked by the ex- well as the decentralized nature of the social me- amining board so that they are not too easy nor dia calls for support to the fact-checking by infor- too difficult for human, and their truth or false- mation technology (Cohen et al., 2011a,b). Al- hood can be objectively determined on the basis though its full automation seems to be beyond cur- 1The questions are posed in Japanese but we use English rent technology (Vlachos and Riedel, 2014; Has- in the examples for the sake of readability.

967 Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 967–975, Taipei, Taiwan, November 27 – December 1, 2017 c 2017 AFNLP of the teaching materials written according to the QA tasks such as factoid-style question-answering official curriculum guidelines. The automatic veri- (Ravichandran and Hovy, 2002; Bian et al., 2008; fication of these statements hence serves as an ide- Ferrucci, 2012). Kanayama et al. (2012) pro- alized but still difficult test-bed for the basic fact- posed to convert a fact-checking question into checking technologies. a set of factoid-style questions. In the conver- In previous studies, three main approaches sion, the named entities in a test statement are in for answering True-or-False questions were pre- turn replaced with an empty slot. The answer, sented: passage retrieval (Kano, 2014), conversion i.e., the most appropriate word that fills the slot, to factoid-style question answering (Kanayama was obtained by an open-domain factoid QA sys- et al., 2012), and textual entailment recogni- tem. They define a confidence score that decreases tion (Tian and Miyao, 2014). In these approaches, when the QA system’s answer differs from the hid- the fact-checking task is directly converted to an- den named entity (i.e., the one replaced with the other, existing problem setting. We however show empty slot). They experimented the idea by manu- that the test statements in the True-or-False ques- ally converting the test statements to factoid ques- tions have compound characteristics through an tions. We follow their idea in designing one of analysis of past exams ( 3). Thus, direct conver- the features. We however fully automatized the § sion alone does not suffice for solving this task sat- conversion and defined another confidence score isfactorily, because each method is built on its own based on a simple document retrieval system in- problem setting, which does not fully cover the stead of a full-fledged factoid QA system ( 4.2.2). § variety of the test statements, especially the false Textual entailment recognition (RTE) (Dagan ones. In this work, to overcome this difficulty, we et al., 2013) has been extensively studied in the attempt to decompose the problems according to field of language processing. RTE can be seen as our observations of past exams and design a solver a quite restricted form of fact-checking where two that integrates the ideas behind the existing meth- sentences t and h are given and a system judges ods as the features of a statistical classifier ( 4). § whether t is an evidence of (i.e., it entails) h or Experimental results show that our decomposition not. Tian and Miyao (2014) showed the effective- of the task of historical fact-checking is success- ness of their logic-based RTE system on the True- ful in that the features work complementarily and or-False questions of NCTUA history exams cast the solver achieved the state-of-the-art accuracy in the form of RTE (i.e., a test statement and an ( 6). An analysis of the remaining errors indicates § evidence sentence are given to the system). a room for further improvement by the incorpora- Recent effort pursued a more realistic task set- tion of linguistic and domain-specific knowledge ting for RTE, in which a system is given a sentence into the system ( 7). § h and a large number of candidates of its evidence Our essential contributions to this problem are t that are drawn from a document collection in as follows. { i} advance. The system tests the entailment relation Careful observations of past exams were con- • between each of tis and h (Bentivogli et al., 2010, ducted, based on which hypothetical charac- 2011). In the RITE-VAL shared task in NTCIR- terizations of the task were formulated. Ev- 2014 conference (Matsuyoshi et al., 2014), the idences were then collected to support these participating RTE systems were evaluated both in hypotheses. the traditional RTE task setting and one that fully According to the observed evidences, five integrates the document retrieval and entailment • features that range over text search, statistics, recognition (i.e., only h and a document collec- and logical entailment were designed. They tion are given to the systems). The test sentences were combined as the features of a classi- (i.e., hs) included those taken from NCTUA world fier and yielded state-of-the-art results on the history exams and hence the latter task setting is task. close to ours. The performance degradation be- tween the two task settings was around 14% (ab- 2 Related Work solute) for the case of the best-performing system. Fact-checking can be framed as a question- It indicates the difficulty of our task setting of the answering (QA) task in a broad sense. How- historical fact-checking. ever, it has not been studied as intensively as other In a series of recent papers, elementary science

968 questions are used as a benchmark of AI systems Knowledge resource Count (ratio) Textbook 111/137 (0.81) (Khot et al., 2015; Jansen et al., 2016; Clark et al., Wikipedia 118/137 (0.86) 2016; Khashabi et al., 2016). The questions, all in Textbook + Wikipedia 129/137 (0.94) the form of multiple-choice questions, were col- Table 1: Ratio of correct statements that can be lected from 4th grade science tests. In the majority evidenced by a single paragraph in the knowledge of the questions, the choices are nouns rather than resource sentences as in the following example taken from (Khashabi et al., 2016): Count (ratio) NE change 163 / 275 (0.59) Q. In New York State, the longest period Time change 47 / 275 (0.17) of daylight occurs during which month? NE or Time change 210 / 275 (0.76) (A) December (B) June (C) March (D) Table 2: Ratio of incorrect statements in which September one named entity or time expression is the reason of the falsity The majority of them can hence be regarded as a factoid-style question with hints (i.e., the answer is one of the four). Clark et al. (2016) and Khashabi more detailed time expression than the questions, et al. (2016) demonstrated that the system per- e.g., “1453” as compared to “15th Century,” and formance was boosted by multi-step inference that “1945” as compared to “1940s.” Third, the fal- combines, e.g., taxonomic knowledge (“N.Y. is in sity of many incorrect statements is attributed to the northern hemisphere”) and general law (“The a single named entity (NE) or time expression in summer solstice in the northern hemisphere is in them. For example, Choice (2) in Fig. 1, “Charle- June”). A special kind of logical relation, tempo- magne repelled the Magyars.” is an incorrect state- ral inclusion, is considered in our system ( 4.3). ment created by changing “Avars” to “Magyars” § The intention is however on the detection of the in a correct sentence “Charlemagne repelled the falsity of a statement that is not in a temporal in- Avars.” clusion relation with an evidence sentence. The To support these hypotheses, we gathered ev- feature based on the conversion to factoid ques- idences from past exams. First, we examined tions is also designed for the detection of false- the correct statements (137 in total) to determine hood by finding a counter-evidence. The different whether: (1) a single paragraph in the knowledge orientations, i.e., proof of a scientific fact vs. dis- resources includes all the NEs in the statement and proof of a historical non-fact, reflect the different (2) the knowledge resources include a more de- natures of the problems. tailed time expression than the statement. Table 1 shows that, in most cases, a single paragraph in- 3 Observation of Task cludes sufficient information to allow the solver to verify the truth of a statement. Among the 137 cor- We examined past True-or-False questions prior rect statements, 48 of them included a time expres- to implementing the solver. The observation tar- sion. The knowledge resources provided a more gets were the NCTUA world history exams 2005, detailed time for 45 out of these 48 statements 2007, 2009, 2011, and 2013s (supplementary (94%). It is thus important to resolve the level of exam). We used four sets of high school textbooks detail of the time expressions. Next, we counted of world history and Wikipedia as knowledge re- the incorrect statements (275 in total) which can sources. be turned into a correct one by changing one NE From the observations, we formulated three hy- or time expression in it. Table 2 shows that to de- potheses as follows. First, to verify most of the test tect the falsity of an incorrect statement, detection statements (i.e., the choices), it is not necessary of the changed NE is crucial. to gather several evidence sentences across differ- ent paragraphs in the resources; usually there is 4 Features and their Combination sufficient information in a local portion, such as a paragraph or a sentence in a knowledge resource. Based on the above observations, we designed the This is a natural consequence of the fact that most following five features to score the confidence of of the test statements describe a single historical truth. This section describes them in turn and ex- event. Second, the knowledge resources include a plains how they are combined as the features of a

969 hidden NE: w = “Avars” statistical classifier. Test statement: These features are defined using several statis- Charlemagne repelled the Avars near Regensburg. tics collected on a set of documents. We cre- Query: S-w = {Charlemagne,repelled, the, near, Regensburg} ated three document sets, Ds, Dp, and Dm, all from Wikipedia and four high school textbooks of Search Engine Document collection world history, as follows. We first segmented the Search results:

Wikipedia pages and the textbooks in two ways: d1 d2 dk … Avars … … Magyars… … Avars … D(S-w) = … one into a set of sentences Ds and the other into a { … … … } set of paragraphs Dp. Dm is the union of Ds and yes no yes Dp. “Avars” ∈ 푑푖 ?

4.1 Text Search Feature 1 ⋅ 푑푠 푑1, 푆−푤 + 0 ⋅ 푑푠 푑2, 푆−푤 + ⋯ + 1 ⋅ 푑푠 푑푘, 푆−푤 푝 Avars | D(S−푤) = The observations revealed that, in most cases, the 푍 NEs in a correct statement are fully included in Figure 2: Calculation of the probability one paragraph or one sentence in the knowledge p(w D(S w)) of finding w in the search re- resources. The text search feature of the statement | − sults D(S w) S is defined as the number of documents in Dm − which include all NEs and content nouns in S. The solver expects that the greater the number is, the each other tend to have strong relation. We write more likely the statement is true. WP(S) for the set of such pairs of words in S ex- 4.2 Statistical Features cluding the pairs of synonymous words. The final PMI feature of the statement S is defined by the The observations showed that an incorrect state- average of pmi(wi, wj). ment is created mainly by changing one NE in a correct sentence. To detect such a conversion, the 1 pmi(S) = pmi(w , w ). solver estimates the strength of relatedness among WP(S) i j (w ,w ) WP(S) the NEs in the statement. The solver uses two sta- | | i j∑∈ tistical features, which are respectively defined us- ing global and local statistics collected on the doc- 4.2.2 Virtual Question Answering (VQA) ument set. Feature The second feature is based on the result of a 4.2.1 Pointwise Mutual Information (PMI) search that approximates the process of factoid Feature QA. It directly attempts to detect the changed The first statistical feature is PMI (Church and word in an incorrect statement (Fig. 2). The solver Hanks, 1989), which is defined for a pair of words hides each NE w in a statement S in order and as, makes a VQA query S w, which is the set of − p(w1, w2 Ds) words in S excluding w. If the statement is cor- pmi(w1, w2) log | , rect, the hidden word is expected to be found in ≡ p(w1 Ds)p(w2 Ds) | | the search results of the query S w at high proba- − where p(wi Ds) denotes the probability of observ- bility. | ing the word wi in a sentence that is randomly cho- For a hidden NE w, we first calculate the ra- sen from Ds and p(w1, w2 Ds) denotes the proba- tio vqa(w) of the probabilities of finding w in the | bility of observing both w1 and w2 in a randomly search results D(S w) of the query S w and in a − − chosen sentence. A low PMI score indicates the randomly chosen document in the collection Dm: independence of the two words. The solver ex- pects that a low PMI score indicates that two NEs p(w D(S w)) vqa(w) = log | − . are not related to each other and suggests that the p(w Dm) statement is incorrect. | The solver calculates PMI for the pairs of an In the numerator, we consider the probability of NE and the subsequent NE or a content word that finding w in a document d D(S w) that is sam- ∈ − appears before the subsequent NE in the state- pled according to the confidence on the search re- ment, because the two words positioned close to sult d against the query S w, rather than assuming −

970 a uniform distribution on D(S w): 4.2.3 Length Feature − We also use the length of the statement (number of p(w D(S w)) = | − words) as a feature. PMI and VQA features of a [w d] p(d D(S w)), (1) long statement tend to have low values regardless ∈ · | − d D(S w) of the correctness of the statement. The length fea- ∈ ∑− ture adjusts this bias. where [w d] is the binary indicator function that ∈ takes value 1 if d includes w, and 0 otherwise. 4.3 Time Feature (Logical Entailment) The confidence factor p(d D(S w)) is assumed to The observations revealed that the level of detail of | − be proportional to a document score ds(d, S w) the time expressions in the choice sentences differs − and we only consider top-k search results. That from that in the knowledge resources. The time is, letting di denote the document ranked i-th in information is a key factor of historical events. the search results according to ds(d, S w), we as- Therefore, the solver needs a more rigorous in- − sume ference about temporal relations than about the matching of other NEs. We implemented a module p(di D(S w)) = that logically determines the inclusion relation be- | − 1 ds(di,S w) Z− (1 i k) tween two time expressions. The time expressions − · ≤ ≤ (2) 0 (k

971 Dataset # test statements %correct %incorrect Context: Coexistence between Christians and DEV 412 33.3% 66.7% was seen on the in the , TEST 1112 35.3% 64.7% ... Table 3: Size of development and test data Instruction: From (1)-(4) below, choose the most ap- propriate sentence that describes the history of in the 20th century related to the underlined portion. 5.1 Custom Dictionary Choices: We use a named entity dictionary and a synonym (1) The French army suffered in guerrilla warfare in dictionary, both of which were manually com- Spain. (2) In the Spanish Civil War, Germany and Italy piled based on textbooks and Wikipedia. The maintained a policy of non-intervention. named entity dictionary was created by mainly us- (3) Franco established a dictatorial regime. ing the index of textbooks. In the dictionary, ap- (4) The Philippines were seized from it by the U.S.A. proximately 20,000 NEs are categorized into var- ious classes (time, person, etc.) by human ex- Figure 3: Question 31 in the 2011 data set perts. The synonym dictionary was created based on Wikipedia redirect and bracketed expressions vided in the context and the instruction. For in- after NEs (e.g., “Charlemagne (Charles I)”). Ad- stance, (1), (3), and (4) in Fig. 3 all describe his- ditionally, the solver uses Nihongo Goi-Taikei4, torical facts but the condition in the instruction, a Japanese thesaurus, to discriminate NEs from “in the 20th century,” turns (1) and (4) to false common nouns. since they happened in the 19th century. Mean- 5.2 Retrieval Module while, the context and the instruction also include The retrieval module of the solver is based irrelevant information that does not affect the truth on Apache Solr5. We used the Solr defaults of the choices. For instance, the underlined por- (TFIDF-weighted cosine similarity) and the Kuro- tion, “Iberian Peninsula,” is redundant since the moji Japanese morphological analyzer6 to tok- instruction asks more restrictively about “Spain.” enize Japanese sentences. As mentioned in 4, all Furthermore, “the Middle Ages” in the context is § knowledge resources are indexed at overlapping not relevant to any of the choices. levels of the sentence and the paragraph, and re- The instruction tends to include a condition that trieval is executed across fields of both levels by applies to all choices, such as the location and the means of the ExtendedDisMax Query Parser7. time of the historical events described in them. The context usually provides relevant condition 5.3 Matching of Words only when it is explicitly indicated so. When two words are compared in the solver, some We utilize these observations as well as the cat- suffixes are ignored to absorb orthographical vari- egory of the NEs to extract only relevant keywords ants (e.g., “Japan” and “Japanese” are considered from the instruction and the context. First, the lo- to be the same). The suffix list is made from high cation names and time expressions are extracted frequency morphemes (Okita and Liu, 2014). We from the instruction if a choice includes no such examined the frequency of morphemes in the text- phrases. The NEs in the underlined portion of books, and then from the top, if the morpheme is the context are then extracted if the instruction a suffix of NE, we added to the list. includes none of the phrases in a pre-defined set Additionally, if a word w in a question has a of cue phrases, such as “related to,” that indicate synonym in a document retrieved from the knowl- the context is not so relevant to the determination edge resources, the word w is considered to be in- of the truth of the choices. Finally, among the cluded in the document. NEs extracted from the context, we discard those categorized as an abstract concept in the NE dic- 5.4 Complementing the Lack of Information tionary, such as “social phenomena” and “social The truth or falsehood of a choice sentence is of- role.” For the choice (1) in Fig. 3, “the 20th cen- ten indeterminable without the information pro- tury” in the instruction is extracted since (1) does

4http://www.iwanami.co.jp/hotnews/GoiTaikei/ not include a time expression but “Iberian Penin- 5http://lucene.apache.org/solr/ sula” in the context is not extracted since the in- 6 http://www.atilika.org/ struction refers to the underlined portion using the 7https://cwiki.apache.org/confluence/display/ solr/The+Extended+DisMax+Query+Parser cue phrase “related to.”

972 DEV TEST System T/F binary acc. 4-way acc. Features Binary T/F 4-way Binary T/F 4-way Kanayama et al. (2012) 79% (73/92) 65% (15/23) All 80.8% 75.7% 74.2% 68.0% This paper (VQA only) 73% (67/92) 52% (12/23) -Text search 77.7% 66.0% 70.5% 60.1% This paper (all features) 84% (77/92) 83% (19/23) -PMI 79.6% 74.8% 73.7% 66.2% -VQA 74.0% 64.1% 71.2% 60.8% Table 6: Comparison with manual question con- -Time 79.6% 75.7% 73.4% 66.9% -Length 80.6% 75.7% 73.8% 67.3% version on NCTUA 2007 questions

Table 4: Feature ablation study

DEV TEST first pattern, the classifiers were trained excluding Features Binary T/F 4-way Binary T/F 4-way one feature (Table 4) and in the second one the Text search 73.8% 58.3% 71.0% 52.5% classifiers were trained with only one feature (Ta- PMI 67.7% 53.4% 66.3% 45.7% VQA 74.8% 58.3% 70.8% 53.2% ble 5). These results show that the VQA and Text Time 68.2% 35.0% 65.3% 28.4% search features are more important than the oth- Length 66.5% 33.0% 65.3% 23.7% ers. The highest T/F judgement accuracy, 80.8% Table 5: Accuracy with only one feature on DEV and 74.2% on TEST, was obtained us- ing all the features. On the DEV set, the abla- tion of one of the five features resulted in a loss of 6 Experiments 0.2-6.8 points in the T/F judgement accuracy and 6.1 Experimental Setup 0.0-11.6 points in the 4-way multiple-choice ac- We exhaustively extracted the True-or-False ques- curacy (Table 4). It suggests that the combination tions from past NCTUA world history exams held of the features is more effective for comparing the from 2005 to 2015 and evaluated the accuracy of confidence on the truth of four statements rather the true-or-false (T/F) binary predictions individ- than for the T/F judgement on a single statement. ually made on each of the choice sentences8. The Comparison of the results in Table 5 with ‘All’ in data was divided into two disjoint subsets, DEV Table 4 further supports it; the effect of combin- and TEST.DEV consists of the questions used ing the five features compared with the result by a in the preliminary analysis described in 3.TEST single feature is far more evident in the accuracy § consists of the rest of the questions. Table 3 pro- on the multiple-choice format. These results show vides the number of the test statements (i.e., the the five features worked complementarily and val- choice sentences) and the distributions of the cor- idates our decomposition of the task into the five rect and incorrect statements. Approximately 20% features. of the questions ask to choose a false statement Table 6 presents a comparison with Kanayama in four choices, which include three correct state- et al. (2012)’s result based on the conversion of ments. For reference, we also report the accuracy T/F judgement to factoid-style questions. The test on the 4-way multiple-choice questions. The an- questions were taken from NCTUA 2007 exam. swer to a multiple-choice question is the choice on This comparison is not strict in a few regards. which the ensemble of the classifiers yielded the First, Kanayama et al. used English translations maximum or minimum score. of the questions while we used the original ques- For the evaluation, we adopted cross validation. tions in Japanese. Second, they manually sup- DEV and TEST were divided into 20 subsets, each plied the choice sentences with necessary infor- of which was taken from the questions in the same mation extracted from the instruction and the con- exam. We applied 20-fold cross validation on the text. Finally, they converted the choice sentences subsets. We summarized the results with DEV and to factoid-style questions also manually, while our TEST respectively. system is fully automated. The results of VQA feature are slightly worse than Kanayama et al.’s 6.2 Experimental Results which is based on a similar idea. The addition of To evaluate the importance of each feature, we the other four features boosted the accuracy and it tested two feature combination patterns. In the surpassed their result. 8 Although the instruction sentences indicate either (i) Finally, Table 7 presents the accuracy of our only one of the choices is correct (“choose the correct one”) system in comparison with previously reported re- or (ii) only one of them is incorrect (“choose the incorrect one”), our solver does not utilize this information in any form sults by fully automatic systems (Shibuki et al., when it makes binary T/F prediction. 2014, 2016). We compared the accuracy on the

973 4-way acc. System Indirect Description of Time (four sentences) NCTUA 2007 This paper 83% (19/23) In Question 15 in the 2007 exam, the instruction Okita and Liu (2014) 74% (17/23) includes the following phrase: “choose the one Kano (2014) 57% (13/23) term that correctly describes the religion that was Sakamoto et al. (2014) 52% (12/23) established after the time of Genghis Khan.” The 4-way acc. System current system cannot extract any temporal infor- NCTUA 2011 mation from “after the time of Genghis Khan,” This paper 90% (18/20) Kobayashi et al. (2016) 80% (16/20) which is equivalent to “after the death of Genghis Takada et al. (2016) 60% (12/20) Khan” and thus to “after 1227.” To this end, we Sakamoto et al. (2016) 60% (12/20) need to analyze the combination of the tempo- Table 7: Comparison with previous automatic sys- ral expression (e.g., the time before/after/during tems of X) and the entity type (e.g., X: Person) and utilize domain-knowledge such as birth/death or establishment/abolishment years of historical en- True-or-False questions in the 4-way multiple- tities. choice format. Our system achieved a higher ac- curacy than the best previous results on NTCUA 8 Conclusion and Future Work 2007 and 2011 exams. As a step towards the goal of automated fact- 7 Error Analysis checking, we have worked on the task of true- or-false judgement on the statements about his- We now show some examples that cannot be torical facts. We scrutinized the characteristics of solved by our current approach and describe the the task and designed five confidence metrics ac- cause of the errors. cording to the observations, which are integrated as the features in a statistical classifier. Experi- Antonym of Verbs (10 sentences) Some in- correct statements in the choices are created by mental results showed that the five features com- changing a verb to its antonym. For example, in plementarily contributed to the discrimination be- Question 8 in the 2009 exam, the falsity of the sen- tween a true statement and a false one. Our sys- tence “The Agricultural Adjustment Act (AAA) tem achieved the state-of-the-art accuracy on a few resulted in the prices of agricultural produce be- datasets. An analysis of the remaining errors indi- ing lowered,” is attributed to the verb “lowered” cated a room for improvement by the incorpora- because the sentence becomes correct if we re- tion of linguistic knowledge such as antonymy of place “lowered” with “raised.” To properly recog- verbs and semantic roles of the events, and extrac- nize such false sentences, we need to utilize lexi- tion of temporal information based on linguistic cal knowledge about the antonymy and synonymy patterns and domain-knowledge. relations among verbs. Acknowledgments Semantic Roles (five sentences) Many histori- This research was supported by Todai Robot cal events involve two or more participants. For Project at National Institute of Informatics. We are example, in Question 3 in the 2007 exam, the sen- gratefully acknowledge Tokyo Shoseki Co., Ltd. tence “The Almohad , which advanced and Yamakawa Shuppansha Ltd. for providing the into the Iberian Peninsula, was overthrown by the textbook data. Almoravid dynasty.” includes the two participants, “The ” and “the Almoravid dy- nasty.” The sentence is incorrect because the truth was “The Almoravid dynasty was overthrown by References the Almohad Caliphate.” To detect this kind of fal- Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo sity, we need to recognize the semantic roles (e.g., Giampiccolo. 2010. The sixth PASCAL recognizing agent and patient) of the participants in the event textual entailment challenge. In TAC. denoted by the verb. It is beyond the expressive- Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo ness of the VQA and PMI features that are largely Giampiccolo. 2011. The seventh PASCAL recog- based on word cooccurrence. nizing textual entailment challenge. In TAC.

974 Jiang Bian, Yandong Liu, Eugene Agichtein, and Suguru Matsuyoshi, Yusuke Miyao, Tomohide Shibata, Hongyuan Zha. 2008. Finding the right facts in the Chuan-Jie Lin, Cheng-Wei Shih, Yotaro Watanabe, crowd: factoid question answering over social me- and Teruko Mitamura. 2014. Overview of the dia. In WWW. pages 467–476. NTCIR-11 recognizing inference in text and valida- tion (RITE-VAL) task. In NTCIR-11. K. Church and P. Hanks. 1989. Word association norms, mutual information, and lexicography. In Tsuyoshi Okita and Qun Liu. 2014. The question an- ACL-89. pages 76–83. swering system of DCUMT in NTCIR-11 QA Lab. In NTCIR-11. Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sab- harwal, Oyvind Tafjord, Peter D. Turney, and Daniel Domingos Pedro. 2012. A few useful things to know Khashabi. 2016. Combining retrieval, statistics, and about machine learning. In Commun. of the ACM. inference to answer elementary science questions. volume 55(10), pages 78–87. In AAAI-2016. pages 2580–2586. Pew Research Center. 2016. News use across social media platforms 2016. Sarah Cohen, James T. Hamilton, and Fred Turner. 2011a. Computational journalism. Commun. ACM Deepak Ravichandran and Eduard Hovy. 2002. Learn- 54(10):66–71. ing surface text patterns for a question answering system. In ACL-2002. pages 41–47. Sarah Cohen, Chengkai Li, Jun Yang, and Cong Yu. 2011b. Computational journalism: A call to arms to Kotaro Sakamoto, Madoka Ishioroshi, Hyogo Mat- database researchers. In CIDR. pages 148–151. sui, Takahisa Jin, Fuyuki Wada, Shu Nakayama, Hideyuki Shibuki, Tatsunori Mori, and Noriko Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas- Kando. 2016. Forst: Question answering system for simo Zanzotto. 2013. Recognizing Textual Entail- second-stage examinations at NTCIR-12 QA Lab-2 ment: Models and Applications. Morgan & Clay- task. In NTCIR-12. pool Publishers. Kotaro Sakamoto, Hyogo Matsui, Eisuke Matsunaga, David A. Ferrucci. 2012. Introduction to “this is wat- Takahisa Jin, Hideyuki Shibuki, Tatsunori Mori, son”. IBM Journal of Research and Development Madoka Ishioroshi, and Noriko Kando. 2014. Forst: 56(3.4):1:1–1:15. Question answering system using basic element at NTCIR-11 QA-Lab task. In NTCIR-11. Naeemul Hassan, Bill Adair, James T. Hamilton, Chengkai Li, Mark Tremayne, Jun Yang, and Cong Hideyuki Shibuki, Kotaro Sakamoto, Madoka Ish- Yu. 2015. The quest to automate fact-checking. In ioroshi, Akira Fujita, Yoshinobu Kano, Teruko Mi- Proc. of the 2015 Computation+Journalism Sympo- tamura, Tatsunori Mori, and Noriko Kando. 2016. sium. Task overview for NTCIR-12 QA Lab-2. In NTCIR- 12. Peter Jansen, Niranjan Balasubramanian, Mihai Sur- deanu, and Peter Clark. 2016. What’s in an expla- Hideyuki Shibuki, Kotaro Sakamoto, Yoshinobu Kano, nation? characterizing knowledge and inference re- Teruko Mitamura, Madoka Ishioroshi, Kelly Y. quirements for elementary science exams. In COL- Itakura, Di Wang, Tatsunori Mori, and Noriko ING. pages 2956–2965. Kando. 2014. Overview of the NTCIR-11 QA-Lab task. In NTCIR-11. Hiroshi Kanayama, Yusuke Miyao, and John Prager. Takuma Takada, Takuya Imagawa, Takuya Matsuzaki, 2012. Answering yes/no questions via question in- and Satoshi Sato. 2016. SML question-answering version. In COLING-2012. pages 1377–1392. system for world history essay and multiple-choice exams at NTCIR-12 QA Lab-2. In NTCIR-12. Yoshinobu Kano. 2014. Solving history exam by key- word distribution: KJP. In NTCIR-11. Ran Tian and Yusuke Miyao. 2014. Answering center- exam questions on history by textual inference. In Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Pe- Proceedings of the 28th Annual Conference of the ter Clark, Oren Etzioni, and Dan Roth. 2016. Ques- Japanese Society for Artificial Intelligence. tion answering via integer programming over semi- structured knowledge. In IJCAI. pages 1145–1152. Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. Tushar Khot, Niranjan Balasubramanian, Eric In Proc. ACL 2014 Workshop on Language Tech- Gribkoff, Ashish Sabharwal, Peter Clark, and Oren nologies and Computational Social Science. pages Etzioni. 2015. Exploring markov logic networks for 18–22. question answering. In EMNLP. pages 685–694.

Mio Kobayashi, Hiroshi Miyashita, Ai Ishii, and Chikara Hoshino. 2016. NUL system at QA Lab- 2 task. In NTCIR-12.

975