Automated Historical Fact-Checking by Passage Retrieval, Word Statistics, and Virtual Question-Answering Mio Kobayashi1 Ai Ishii1 Chikara Hoshino1 Hiroshi Miyashita1

Automated Historical Fact-Checking by Passage Retrieval, Word Statistics, and Virtual Question-Answering Mio Kobayashi1 Ai Ishii1 Chikara Hoshino1 Hiroshi Miyashita1 Takuya Matsuzaki2 1Nihon Unisys Ltd., Japan {mio.kobayashi, ai.ishii, chikara.hoshino, hiroshi.miyashita}@unisys.co.jp 2Nagoya University, Japan [email protected] Abstract Context: ... During the period of the Carolingian dynasty of Francia, the Roman Catholic Church This paper presents a hybrid approach to preached that the religious cleansing of sins was necessary in order to achieve salvation after death. ... the verification of statements about his- Instruction: From (1)-(4) below, choose the one correct torical facts. The test data was collected sentence concerning events during the 8th century when from the world history examinations in the kingdom referred to in the underlined portion was a standardized achievement test for high established. school students. The data includes var- Choices: ious kinds of false statements that were (1) Pepin destroyed the Kingdom of the Lombards. (2) Charlemagne repelled the Magyars. carefully written so as to deceive the stu- (3) The reign of Emperor Taizong of Tang was called dents while they can be disproven on the the Kaiyuan era. basis of the teaching materials. Our sys- (4) The reign of Harun al-Rashid began. tem predicts the truth or falsehood of a statement based on text search, word cooc- currence statistics, factoid-style question Figure 1: Example of a True-or-False question answering, and temporal relation recognition. These features contribute to the san et al., 2015), even its partial automation would judgement complementarily and achieved greatly enhance the power of the current fact- the state-of-the-art accuracy. checking services. As a step towards this direction, we take up the 1 Introduction automatic verification of a statement about histori- The proliferation of social media in the Internet cal facts against credible information sources. The drastically changed the status of traditional jour- test statements are collected from the world his- nalism, which has been an indispensable build- tory examinations in a standardized achievement ing block of modern democracy. News are now test for high school students in Japan (the National produced, propagated, and consumed by people in Center Test for University Admissions, NTCUA). quite a different way than twenty years ago (Pew Approximately 60% of the NCTUA world history Research Center, 2016). The downside is that fake exams are “True-or-False” questions. A question news and hoaxes spread through the social net- in this format consists of a paragraph of text that work as quickly as those from trustable sources. provides the context of the question, an instruc- A mechanism for fact-checking, i.e., finding a sup- tion sentence, and four choices (Fig. 1)1. One has port or disproof of a claim in a credible informa- to choose a correct or incorrect statement from the tion source, is thus needed as a new social infras- four choices according to the instruction. tructure. The test statements in the True-or-False ques- The sheer amount of the information flow as tions are thoroughly tuned and checked by the ex- well as the decentralized nature of the social me- amining board so that they are not too easy nor dia calls for support to the fact-checking by infor- too difficult for human, and their truth or false- mation technology (Cohen et al., 2011a,b). Al- hood can be objectively determined on the basis though its full automation seems to be beyond cur- 1The questions are posed in Japanese but we use English rent technology (Vlachos and Riedel, 2014; Has- in the examples for the sake of readability. 967 Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 967–975, Taipei, Taiwan, November 27 – December 1, 2017 c 2017 AFNLP of the teaching materials written according to the QA tasks such as factoid-style question-answering official curriculum guidelines. The automatic veri- (Ravichandran and Hovy, 2002; Bian et al., 2008; fication of these statements hence serves as an ide- Ferrucci, 2012). Kanayama et al. (2012) pro- alized but still difficult test-bed for the basic fact- posed to convert a fact-checking question into checking technologies. a set of factoid-style questions. In the conver- In previous studies, three main approaches sion, the named entities in a test statement are in for answering True-or-False questions were pre- turn replaced with an empty slot. The answer, sented: passage retrieval (Kano, 2014), conversion i.e., the most appropriate word that fills the slot, to factoid-style question answering (Kanayama was obtained by an open-domain factoid QA sys- et al., 2012), and textual entailment recogni- tem. They define a confidence score that decreases tion (Tian and Miyao, 2014). In these approaches, when the QA system’s answer differs from the hid- the fact-checking task is directly converted to an- den named entity (i.e., the one replaced with the other, existing problem setting. We however show empty slot). They experimented the idea by manu- that the test statements in the True-or-False ques- ally converting the test statements to factoid questions have compound characteristics through an tions. We follow their idea in designing one of analysis of past exams ( 3). Thus, direct conver- the features. We however fully automatized the § sion alone does not suffice for solving this task sat- conversion and defined another confidence score isfactorily, because each method is built on its own based on a simple document retrieval system in- problem setting, which does not fully cover the stead of a full-fledged factoid QA system ( 4.2.2). § variety of the test statements, especially the false Textual entailment recognition (RTE) (Dagan ones. In this work, to overcome this difficulty, we et al., 2013) has been extensively studied in the attempt to decompose the problems according to field of language processing. RTE can be seen as our observations of past exams and design a solver a quite restricted form of fact-checking where two that integrates the ideas behind the existing meth- sentences t and h are given and a system judges ods as the features of a statistical classifier ( 4). § whether t is an evidence of (i.e., it entails) h or Experimental results show that our decomposition not. Tian and Miyao (2014) showed the effective- of the task of historical fact-checking is success- ness of their logic-based RTE system on the True- ful in that the features work complementarily and or-False questions of NCTUA history exams cast the solver achieved the state-of-the-art accuracy in the form of RTE (i.e., a test statement and an ( 6). An analysis of the remaining errors indicates § evidence sentence are given to the system). a room for further improvement by the incorpora- Recent effort pursued a more realistic task set- tion of linguistic and domain-specific knowledge ting for RTE, in which a system is given a sentence into the system ( 7). § h and a large number of candidates of its evidence Our essential contributions to this problem are t that are drawn from a document collection in as follows. { i} advance. The system tests the entailment relation Careful observations of past exams were con- • between each of tis and h (Bentivogli et al., 2010, ducted, based on which hypothetical charac- 2011). In the RITE-VAL shared task in NTCIR- terizations of the task were formulated. Ev- 2014 conference (Matsuyoshi et al., 2014), the idences were then collected to support these participating RTE systems were evaluated both in hypotheses. the traditional RTE task setting and one that fully According to the observed evidences, five integrates the document retrieval and entailment • features that range over text search, statistics, recognition (i.e., only h and a document collec- and logical entailment were designed. They tion are given to the systems). The test sentences were combined as the features of a classi- (i.e., hs) included those taken from NCTUA world fier and yielded state-of-the-art results on the history exams and hence the latter task setting is task. close to ours. The performance degradation between the two task settings was around 14% (ab- 2 Related Work solute) for the case of the best-performing system. Fact-checking can be framed as a question- It indicates the difficulty of our task setting of the answering (QA) task in a broad sense. How- historical fact-checking. ever, it has not been studied as intensively as other In a series of recent papers, elementary science 968 questions are used as a benchmark of AI systems Knowledge resource Count (ratio) Textbook 111/137 (0.81) (Khot et al., 2015; Jansen et al., 2016; Clark et al., Wikipedia 118/137 (0.86) 2016; Khashabi et al., 2016). The questions, all in Textbook + Wikipedia 129/137 (0.94) the form of multiple-choice questions, were col- Table 1: Ratio of correct statements that can be lected from 4th grade science tests. In the majority evidenced by a single paragraph in the knowledge of the questions, the choices are nouns rather than resource sentences as in the following example taken from (Khashabi et al., 2016): Count (ratio) NE change 163 / 275 (0.59) Q. In New York State, the longest period Time change 47 / 275 (0.17) of daylight occurs during which month? NE or Time change 210 / 275 (0.76) (A) December (B) June (C) March (D) Table 2: Ratio of incorrect statements in which September one named entity or time expression is the reason of the falsity The majority of them can hence be regarded as a factoid-style question with hints (i.e., the answer is one of the four).

Load more