Riposte! a Large Corpus of Counter-Arguments
Total Page:16
File Type:pdf, Size:1020Kb
Riposte! A Large Corpus of Counter-Arguments Paul Reisert y;z Benjamin Heinzerling y;z Naoya Inoue z;y Shun Kiyonoy;z Kentaro Inui z;y y RIKEN Center for Advanced Intelligence Project z Tohoku University fpaul.reisert,benjamin.heinzerling,[email protected] fnaoya-i,[email protected] Abstract Topic: Is It Fair to Rate Professors Online? Argument: It is not fair to rate professors online because the Constructive feedback is an effective method online ratings effect teacher's careers and are uncontrolled. for improving critical thinking skills. Counter- Worker A: If “not fair to rate professors online” is assumed to arguments (CAs), one form of constructive be true, then “the online ratings effect teacher's careers and are feedback, have been proven to be useful for uncontrolled” is already assumed to be true. (Begging the critical thinking skills. However, little work Question) has been done for constructing a large-scale Worker B: There is a questionable cause in the argument corpus of them which can drive research on because “online ratings” does/will not cause “teacher's careers automatic generation of CAs for fallacious to be effected”. (Questionable Cause) micro-level arguments (i.e. a single claim and Worker C: Rating professors online will make choosing for premise pair). In this work, we cast providing students easier (None) constructive feedback as a natural language processing task and create Riposte!, a corpus Figure 1: CAs in Riposte! produced by crowdwork- of CAs, towards this goal. Produced by crowd- ers. The fallacy type selected by a worker is shown in workers, Riposte! contains over 18k CAs. We parentheses. instruct workers to first identify common fal- lacy types and produce a CA which identifies the fallacy. We analyze how workers create CAs and construct a baseline model based on with a fallacy (i.e. errors in the logical reasoning our analysis. of the argument) and its CAs (i.e. attacks to the argument). In the field of NLP, previous works 1 Introduction have addressed fallacy identification (Habernal Critical thinking is a crucial skill necessary for et al., 2018a), CA retrieval (Wachsmuth et al., valid reasoning, especially for students in a peda- 2017), and CA generation for macro-level argu- gogical context. Towards improving critical think- ments (Hua and Wang, 2018), and essay crite- ing skills for students, educators have evaluated ria such as thesis clarity (Persing and Ng, 2013), the contents of a work and provided constructive argument strength (Persing and Ng, 2015), and feedback (i.e. criticism) to the student. Although stance (Persing and Ng, 2016) have been eval- such methods are effective, they require educators uated. However, in the pedagogical context, arXiv:1910.03246v1 [cs.CL] 8 Oct 2019 to articulately evaluate the contents of an essay, macro-level arguments (e.g., an essay) may con- which can be time-consuming and varies depend- sist of several micro-level arguments (i.e. one ing on an educator’s critical thinking skills. claim/premise pair) that can each contain multi- In the field of educational research, the use- ple fallacies. To bridge this gap, we create CAs fulness of identifying fallacies and counter- for micro-level arguments which can be useful for arguments, henceforth CAs, as constructive feed- automatic constructive feedback generation. back has been emphasized (de Lima Alves, 2008; Several challenges exist for creating a corpus Oktavia et al., 2014; Indah and Kusuma, 2015; of CAs for constructive feedback. First, the cor- Song and Ferretti, 2013), as both can help writ- pus must contain a variety of different topics and ers produce high-quality arguments while simul- arguments to both train and evaluate a model for taneously improving their critical thinking skills. unseen topics. Second, an argument can have Shown in Figure1 is an example of an argument many different fallacies which are not easily iden- Fallacy Type Definition Template Begging the Question ( ) The truth of the premise is already as- “If [something] is assumed to be true, then sumed by the claim. [something else] is already assumed to be true”. Hasty Generalization( ) Someone assumes something is generally “It’s too hasty to assume that [text]”. always the case based on a few instances. Questionable Cause ( ) The cause of an effect is questionable. “There is a questionable cause in the argu- ment because [questionable cause] does/will not cause [effect]”. Red Herring ( ) Someone reverts attention away from the “The topic being discussed is [first topic], but it original claim by changing the topic. is being changed to [second topic]”. Table 1: Definition and templates of fallacy types used in our experiments. may choose to use a pretrained, supervised model Criteria Total for a single topic with editable background knowl- Unsure 2,043 315 2,136 1,879 6,373 edge (i.e., educators can choose which knowledge CAs (FS) 3,365 3,818 2,121 1,772 11,076 is necessary to automatically construct feedback). CAs (O) 907 2,182 2,058 2,664 7,811 On the other hand, educators may choose a new CAs (total) 4,272 6,000 4,179 4,436 18,887 topic each year, and thus a conditioned model for Table 2: Full statistics of the Riposte! corpus, where FS multiple topics may also be considered. The input represents fallacy-specific CAs and O represents other. to a model should be a topic and several claim and premise argument pairs, and the output would be a set of CAs useful for improving the argument. tifiable (Oktavia et al., 2014; Indah and Kusuma, 2015; El Khoiri and Widiati, 2017). Third, pro- 2.2 Existing corpus of arguments ducing CAs is costly and time-consuming. When training a model for constructive feedback, In this work, we design a task for automatic the data should consist of many CAs for a wide constructive feedback and create Riposte!, a large- variety of topics. We use the Argument Rea- scale corpus of CAs via crowdsourcing. Work- soning Comprehension (ARC) dataset (Habernal ers are first instructed to identify common fal- et al., 2018b), a corpus of 1,263 unique topic- lacy types (begging the question, hasty generaliza- claim-premise pairs (172 unique topics and 264 tion, questionable cause, and red herring) in edu- unique claims). We assume the arguments in ARC cational research (de Lima Alves, 2008; Oktavia contain many fallacies because they were created et al., 2014; Indah and Kusuma, 2015; Song and by non-expert crowdworkers (i.e., workers are not Ferretti, 2013) and create a CA for micro-level ar- experts in the field of argumentation). guments. In total, we collect 18,887 CAs (see Fig- ure1 for examples of CAs in Riposte!). We then 2.3 Riposte! creation cast automatic constructive feedback as a text gen- For creating Riposte!, we use the crowdsourcing eration task and create a baseline model. platform Amazon Mechanical Turk.1 Data Collection One challenge for collecting 2 The Riposte! corpus training data for automatic constructive feedback In this section, we determine if training data can is that the CAs should be useful for improving easily be created. To the best of our knowledge, an argument. To assist with collecting such CAs, this is the first research that addresses corpus con- we adopt Reisert et al.(2019)’s protocol for col- struction for automatic constructive feedback. lecting CAs using crowdsourcing. We first make several modifications for our data collection (see 2.1 Counter-arguments as an NLP task Appendix). We create 4 separate crowdsourcing tasks (i.e., one for each fallacy type). For each When designing a task for automatic constructive of the 1,263 arguments in ARC, we ask 5 work- feedback, one must take into account real-world ers to produce a CA. For each fallacy type, we as- situations. In the pedagogical context, educators sist workers by providing them with a “fill-in-the- can choose the same topic for students annually. With automatic constructive feedback, educators 1https://www.mturk.com/ 4 Experiment Criteria Total 4.1 Experimental design Score 0.61 0.17 0.35 0.36 0.24 In Section3, we observed that workers copied key- Table 3: The average Jaccard’s similarity scores be- words from the argument when creating a CA. tween CAs for a single argument for each fallacy type. Based on this observation, we experiment with different input settings to the model to better un- derstand which parts of the argument annotators blank” template, where workers were instructed to used to create their argument (e.g., topic (T) only, fill in text boxes for a given pattern. The fallacy premise (P) only, claim (C) only, and so forth). We types and templates are shown in Table1. cast the task of automatic constructive feedback as a generation task and experiment with such set- 2.4 Riposte! statistics tings. The statistics of Riposte! are shown in Ta- Since both new and existing essay topics can ble2.11,076 of the CAs are fallacy-specific (i.e. be used and introduced by educators, we consider workers first identified a fallacy and then created two possible settings: i) in-domain (i.e. topics are the CA), and 7,811 CAs were created when a shared between splits) and ii) out-of-domain (i.e. worker did not believe the specified fallacy existed topics are not shared). in the argument. 6,373 instances were labeled as For our generation model, we use gold fallacy 3 unsure (i.e. the worker was unsure about the fal- type information. This allows us to understand lacy type). how well the model can generate CAs when cor- rect fallacy types are predicted.