Curriculum Learning for Natural Language Understanding Benfeng Xu1∗, Licheng Zhang1∗ , Zhendong Mao1†, Quan Wang2 , Hongtao Xie1 and Yongdong Zhang1

Curriculum Learning for Natural Language Understanding Benfeng Xu1∗, Licheng Zhang1∗ , Zhendong Mao1y, Quan Wang2 , Hongtao Xie1 and Yongdong Zhang1 1School of Information Science and Technology, University of Science and Technology of China, Hefei, China 2Beijing Research Institute, University of Science and Technology of China, Beijing, China fbenfeng,[email protected], [email protected] fzdmao,htxie,[email protected] Abstract Easy cases: easy, comfortable positive With the great success of pre-trained lan- most purely enjoyable positive guage models, the pretrain-finetune paradigm most plain, unimaginative negative badly edited negative now becomes the undoubtedly dominant solu- tion for natural language understanding (NLU) Hard cases: why didn’t Hollywood think of this sooner positive tasks. At the fine-tune stage, target task data I simply can’t recommend it enough positive is usually introduced in a completely random supposedly funny movie negative order and treated equally. However, examples occasionally interesting negative in NLU tasks can vary greatly in difficulty, and similar to human learning procedure, language Table 1: Examples from SST-2 sentiment classification models can benefit from an easy-to-difficult task. Difficulty levels are determined by our review curriculum. Based on this idea, we propose method (detailed later). our Curriculum Learning approach. By re- viewing the trainset in a crossed way, we are able to distinguish easy examples from difficult ones, and arrange a curriculum for lan- Most current approaches perform fine-tuning in a guage models. Without any manual model ar- straightforward manner, i.e., all training examples chitecture design or use of external data, our are treated equally and presented in a completely Curriculum Learning approach obtains signifi- random order during training. However, even in the cant and universal performance improvements same NLU task, the training examples could vary on a wide range of NLU tasks. significantly in their difficulty levels, with some easily solvable by simple lexical clues while others 1 Introduction requiring sophisticated reasoning. Table1 shows Natural Language Understanding (NLU), which re- some examples from the SST-2 sentiment classifi- quires machines to understand and reason with hu- cation task (Socher et al., 2013), which identifies man language, is a crucial yet challenging problem. sentiment polarities (positive or negative) of movie Recently, language model (LM) pre-training has reviews. The easy cases can be solved directly by achieved remarkable success in NLU. Pre-trained identifying sentiment words such as “comfortable” LMs learn universal language representations from and “unimaginative”, while the hard ones further large-scale unlabeled data, and can be simply fine- require reasoning with negations or verb qualifiers tuned with a few adjustments to adapt to various like “supposedly” and “occasionally”. Extensive NLU tasks, showing consistent and significant im- research suggests that presenting training examples provements in these tasks (Radford et al., 2018; in a meaningful order, starting from easy ones and Devlin et al., 2018). gradually moving on to hard ones, would benefit While lots of attention has been devoted to de- the learning process, not only for humans but also signing better pre-training strategies (Yang et al., for machines (Skinner, 1958; Elman, 1993; Peter- 2019; Liu et al., 2019; Raffel et al., 2019), it is also son, 2004; Krueger and Dayan, 2009). valuable to explore how to more effectively solve Such an organization of learning materials in downstream NLU tasks in the fine-tuning stage. human learning procedure is usually referred to ∗Equal contribution. as Curriculum. In this paper, we draw inspira- y Corresponding author. tion from similar ideas, and propose our approach 6095 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6095–6104 July 5 - 10, 2020. c 2020 Association for Computational Linguistics for arranging a curriculum when learning NLU architecture design or use of external data, we are tasks. Curriculum Learning (CL) is first proposed able to obtain significant and universal improve- by (Bengio et al., 2009) in machine learning area, ments on a wide range of downstream NLU tasks. where the definition of easy examples is established Our contributions can be concluded as follows: ahead, and an easy-to-difficult curriculum is ar- ranged accordingly for the learning procedure. Re- • We explore and demonstrate the effectiveness cent developments have successfully applied CL of CL in the context of finetuning LM on NLU in computer vision areas (Jiang et al., 2017; Guo tasks. To the best of our knowledge, this is one et al., 2018; Hacohen and Weinshall, 2019). It is of the first times that CL strategy is proved to observed in these works that by excluding the neg- be extensively prospective in learning NLU ative impact of difficult or even noisy examples tasks. in early training stage, an appropriate CL strategy • We propose a novel CL framework that con- can guide learning towards a better local minima in sists of a Difficulty Review method and a parameter space, especially for highly non-convex Curriculum Arrangement algorithm, which deep models. We argue that language models like requires no human pre-design and is very gen- transformer, which is hard to train (Popel and Bojar, eralizable to a lot of given tasks. 2018), should also benefit from CL in the context of learning NLU tasks, and such idea still remains • We obtain universal performance gain on a unexplored. wide range of NLU tasks including Machine The key challenge in designing a successful CL Reading Comprehension (MRC) and Natural strategy lies in how to define easy/difficult exam- Language Inference. The improvements are ples. One straightforward way is to simply pre- especially significant on tasks that are more define the difficulty in revised rules by observing challenging. the particular target task formation or training data structure accordingly (Guo et al., 2018; Platanios 2 Preliminaries et al., 2019; Tay et al., 2019). For example, (Ben- We describe our CL approach using BERT (De- gio et al., 2009) utilized an easier version of shape vlin et al., 2018), the most influential pre-trained recognition trainset which comprised of less varied LM that achieved state-of-the-art results on a wide shapes, before the training of complex one started. range of NLP tasks. BERT is pretrained using More recently, (Tay et al., 2019) considered the Masked Language Model task and Next sentence paragraph length of a question answering example Prediction task via large scale corpora. It consists as its reflection of difficulty. However, such strate- of a hierarchical stack of l self-attention layers, gies are highly dependent on the target dataset itself which takes an input of a sequence with no more and often fails to generalize to different tasks. than 512 tokens and output the contextual repre- To address this challenge, we propose our Cross sentation of a H-dimension vector for each token Review method for evaluating difficulty. Specifi- l H in position i, which we denote as hi 2 R . In cally, we define easy examples as those well solved natural language understanding tasks, the input se- by the exact model that we are to employ in the quences usually start with special token hCLSi, and task. For different tasks, we adopt their correspond- end with hSEPi, for sequences consisting of two ing golden metrics to calculate a difficulty score for segments like in pairwise sentence tasks, another each example in the trainset. Then based on these hSEPi is added in between for separating usage. difficulty scores, we further design a re-arranging For target benchmarks, we employ a wide range algorithm to construct the learning curriculum in of NLU tasks, including machine reading compre- an annealing style, which provides a soft transition hension, sequence classification and pairwise text from easy to difficult for the model. In general, our similarity, etc.. Following (Devlin et al., 2018), we CL approach is not constrained to any particular adapt BERT for NLU tasks in the most straightfor- task, and does not rely on human prior heuristics ward way: simply add one necessary linear layer about the task or dataset. upon the final hidden outputs, then finetune the Experimental results show that our CL approach entire model altogether. Specifically, we brief the can greatly help language models learn in their configurations and corresponding metrics for dif- finetune stage. Without any task-tailored model ferent tasks employed in our algorithms as follows: 6096 Machine Reading Comprehension In this work we consider the extractive MRC task. Given a passage P and a corresponding question Q, the goal is to extract a continuous span hpstart; pendi from P as the answer A, where the start and end are its boundaries. We pass the concatenation of the question and paragraph [hCLSi; Q; hSEPi; P; hSEPi] to the pre- trained LM and use a linear classifier on top of it to predict the answer span boundaries. Figure 1: Our Cross Review method: the target dataset For the i − th input token, the probabilities that is split into N meta-datasets, after the teachers are trained on them, each example will be inferenced by all it is the start or end are calculated as: other teachers (except the one it belongs to), the scores start end T T l will be summed as the final evaluation results. [logiti ; logiti ] = WMRC hi start start pi = softmax(flogiti g) end end 3 Our CL Approach pi = softmax(flogiti g) T 2×H where WMRC 2 R is a trainable matrix. The We decompose our CL framework into two stages: training objective is the log-likelihood of the true Difficulty Evaluation and Curriculum Arrange- start and end positions ystart and yend: ment. For any target task, let D be the examples start end set used for training, and Θ be our language model loss = −(log(py ) + log(py )) start end which is expected to fit D.

Load more