Critical Thinking for Language Models
Total Page:16
File Type:pdf, Size:1020Kb
Critical Thinking for Language Models Gregor Betz Christian Voigt Kyle Richardson KIT KIT Allen Institute for AI Karlsruhe, Germany Karlsruhe, Germany Seattle, WA, USA [email protected] [email protected] [email protected] Abstract quality of online debates (Hansson, 2004; Guia¸su and Tindale, 2018; Cheng et al., 2017): Neural lan- This paper takes a first step towards a critical guage models are known to pick up and reproduce thinking curriculum for neural auto-regressive normative biases (e.g., regarding gender or race) language models. We introduce a synthetic present in the dataset they are trained on (Gilburt corpus of deductively valid arguments, and generate artificial argumentative texts to train and Claydon, 2019; Blodgett et al., 2020; Nadeem CRiPT: a critical thinking intermediarily pre- et al., 2020), as well as other annotation artifacts trained transformer based on GPT-2. Signifi- (Gururangan et al., 2018); no wonder this happens cant transfer learning effects can be observed: with argumentative biases and reasoning flaws, too Trained on three simple core schemes, CRiPT (Kassner and Schütze, 2020; Talmor et al., 2020). accurately completes conclusions of different, This diagnosis suggests that there is an obvious and more complex types of arguments, too. remedy for LMs’ poor reasoning capability: make CRiPT generalizes the core argument schemes in a correct way. Moreover, we obtain con- sure that the training corpus contains a sufficient sistent and promising results for NLU bench- amount of exemplary episodes of sound reasoning. marks. In particular, CRiPT’s zero-shot accu- In this paper, we take a first step towards the cre- racy on the GLUE diagnostics exceeds GPT- ation of a “critical thinking curriculum” for neural 2’s performance by 15 percentage points. The language models. Critical thinking can be loosely findings suggest that intermediary pre-training defined as “reasonable reflective thinking that is on texts that exemplify basic reasoning abili- ties (such as typically covered in critical think- focused on deciding what to believe or do.” (Norris ing textbooks) might help language models to and Ennis, 1989) Generally speaking, our study acquire a broad range of reasoning skills. The exploits an analogy between teaching critical think- synthetic argumentative texts presented in this ing to students and training language models so paper are a promising starting point for build- as to improve their reasoning skill. More specifi- ing such a “critical thinking curriculum for lan- cally, we build on three key assumptions that are guage models.” typically made in critical thinking courses and text- books: First, there exist fundamental reasoning 1 Introduction skills that are required for, or highly conducive to, Pre-trained autoregressive language models (LM) a large variety of more specific and advanced criti- such as GPT-2 and GPT-3 achieve, remarkably, com- cal thinking skills (e.g., Fisher, 2001, p. 7). Second, petitive results in a variety of language modeling drawing deductive inferences is one such basic abil- benchmarks without task-specific fine-tuning (Rad- ity (e.g., Fisher, 2001, pp. 7–8). Third, reasoning ford et al., 2019; Brown et al., 2020). Yet, it is also skills are not (just) acquired by learning a theory of widely acknowledged that these models struggle correct reasoning, but by studying lots of examples with reasoning tasks, such as natural language in- and doing “lots of good-quality exercises” (Lau ference (NLI) or textual entailment (Askell, 2020). and Chan, 2020), typically moving from simple to Actually, that doesn’t come as a surprise, given the more difficult problems (e.g., Bowell and Kemp, tendency of humans to commit errors in reason- 2014). ing (Kahneman, 2011; Sunstein and Hastie, 2015), These insights from teaching critical thinking their limited critical thinking skills (Paglieri, 2017), translate, with respect to our study, as follows (see and the resulting omnipresence of fallacies and bi- Fig.1). First of all, we design and build ‘lots of ases in texts and the frequently low argumentative good-quality exercises’: a synthetic corpus of de- 63 Proceedings of the 14th International Conference on Computational Semantics, pages 63–75 June 17–18, 2021. ©2021 Association for Computational Linguistics synthetic GPT-2 mediary pre-training on high-quality texts that ex- argumentative texts Step 3: argument Here comes a valid assessment emplify a basic reasoning skill, namely simple de- Step 1: Step 2: schemes argument: To begin on reasoning multi- inter- with, Susan is a friend tasks: ductive argumentation. Obviously, drawing correct ∀x Fx→¬Gx step mediary of Chloe. Moreover, no conclusion Ga temp- pre- • inferences is just one of the elementary skills typ- sister of Lisa is a friend completion ——⇩—— lating training ¬Fa of Chloe. Therefore, it • GLUE AX, ically covered in critical thinking courses (Fisher, is false that Susan is a SNLI, etc. sister of Lisa. CRiPT 2001). Critical thinking involves more than deduc- tion. And it would hence, by analogy, be unreason- Figure 1: Training and testing of CRiPT language mod- able to expect that intermediary pre-training on the els (critical thinking intermediarily pre-trained trans- synthetic argument corpus suffices to turn language former) with synthetic argumentative texts. models into accomplished reasoners. However, we have shown that argumentative texts (with valid ductively valid arguments which instantiate a vari- syllogistic arguments) are certainly a good starting ety of (syllogistic) argument schemes, and which point when building a more comprehensive dataset are rendered as text paragraphs (Section3). Next, for initial or intermediary pre-training that might we use our synthetic argument text corpus to train help language models to acquire a broad range of and to evaluate GPT-2 (Section4). The training, reasoning skills. Or, to put it differently, the syn- which maximizes a causal language modeling ob- thetic argumentative texts might belong to the core jective, can be conceived of as a generic, intermedi- of a “critical thinking curriculum for language mod- ary pre-training in the spirit of STILTS (Phang els.” In the final section, we advance some ideas et al., 2018) and yields models we term CRiPT for complementing the artificial argument corpus (critical thinking intermediarily pre-trained trans- so as to further improve the performance of LMs former). with regard to different reasoning benchmarks. Evaluating CRiPT’s ability to correctly complete 2 Related Work conclusions of arguments, we observe strong trans- fer learning effects/generalization (Section5): Just To our knowledge, this paper is, together with Gon- training CRiPT on a few central core schemes (gen- tier et al.(2020), among the first to show that au- eralized modus ponens, contraposition and chain toregressive language models like GPT-2 can learn rule) allows it to accurately complete conclusions to reason by training on a text corpus of correct of different types of arguments, too (e.g., complex natural language arguments. By contrast, previ- argumentative forms that involve dilemma and de ous work in this field, described below, has typ- Morgan). The language models appear to connect ically modeled natural language reasoning prob- and generalize the core argument schemes in a cor- lems as classification tasks and trained neural sys- rect way. In addition, CRiPT is equally able to apply tems to accomplish them. For example, Schick learned argument patterns beyond the training cor- and Schütze(2021); Schick and Schütze(2020) pus’ domain. find that a masked language model with classifica- tion head achieves remarkable NLU performance Moreover, we test CRiPT on different reasoning benchmarks. Because we are particularly inter- by pre-structuring the training data. This paper ested in transfer learning effects, we do so in a explores the opposite route: We start with highly zero-shot set-up (i.e., evaluating our argumentation structured (synthetic) data, render it as unstruc- models on entirely unrelated NLU tasks, which fol- tured, plain text and train a uni-directional lan- lows recent work by Mitra et al.(2019); Shwartz guage model on the synthetic text corpus. et al.(2020); Ma et al.(2020)). We obtain consis- Over and above the methodological novelty of tent and promising results for the GLUE diagnos- our approach, we discuss, in the following, related tics (Wang et al., 2018) and SNLI (Bowman et al., reasoning benchmarks and explain what sets our 2015) benchmarks (Section5), finding that training synthetic argument corpus apart from this work. on core schemes clearly improves the NLU skills Rule reasoning in natural language Various of pre-trained models. datasets have been developed for (deductive) rule All these transfer learning effects observed reasoning in natural language. One-step rule appli- strengthen the analogy between teaching critical cation (cf. Weston et al., 2016; Richardson et al., thinking and training language models: A variety 2020; Tafjord et al., 2019; Lin et al., 2019) closely of reasoning skills are improved by generic, inter- resembles the conclusion completion task for gen- 64 eralized modus ponens and generalized modus tol- critical thinking and problem solving skills. Its lens schemes described below. However, we go scope is much broader than our highly specific and beyond previous work in investigating the ability of carefully designed argument corpus. LMs to infer conclusions that have a more complex logico-semantic structure (e.g., existential or uni- 3 An Artificial Argument Corpus versal statements). RuleTaker, arguably the most general system for rule reasoning in natural lan- This section describes the construction of a syn- guage so far, is a transformer model for multi-hop thetic corpus of natural language arguments used 1 inference (Clark et al., 2020). PRover (Saha et al., for training and evaluating CRiPT. 2020) extends RuleTaker by a component for proof The corpus is built around eight simple, deduc- generation and is able to construct valid proofs and tively valid syllogistic argument schemes (top row outperforms RuleTaker in terms answer accuracy in Fig.2).