Pre-Training Is (Almost) All You Need: an Application to Commonsense Reasoning

Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning Alexandre Tamborrino∗, Nicola Pellicano`∗, Baptiste Pannier∗, Pascal Voitot, Louise Naudin Samsung Strategy and Innovation Center fa.tamborrino,n.pellicano,b.pannier,p.voitot,[email protected] Abstract In this paper, we tackle a subset of NLP tasks consisting in plausibility ranking. Such tasks can Fine-tuning of pre-trained transformer models be formalised as follows: given a unique premise has become the standard approach for solv- p and a set of hypotheses H = fh g , the ing common NLP tasks (Devlin et al., 2019). i i=1:::n task consists in returning the appropriate hypothe- Most of the existing approaches rely on a ran- ∗ domly initialized classifier on top of such net- sis h 2 H that matches p (see Section3 for more works. We argue that this fine-tuning proce- details). A natural task that fits into this problem dure is sub-optimal as the pre-trained model formulation is commonsense reasoning. Thus, it has no prior on the specific classifier labels, will be the main focus of the present paper. while it might have already learned an intrinsic Traditionally, this problem is solved by jointly textual representation of the task. In this paper, classifying each pair (p; hi)i=1:::n. For instance, we introduce a new scoring method that casts assuming a Masked Language Modeling (MLM) a plausibility ranking task in a full-text format and leverages the masked language modeling model is used, an example from the COPA dataset head tuned during the pre-training phase. We (Gordon et al., 2012) is commonly casted into two study commonsense reasoning tasks where the distinct examples: model must rank a set of hypotheses given a • [CLS] The man broke his toe. [SEP] premise, focusing on the COPA (Gordon et al., He dropped a hammer on his foot. 2012), Swag (Zellers et al., 2018), HellaSwag [SEP] ! correct (Zellers et al., 2019) and CommonsenseQA • [CLS] The man broke his toe. [SEP] (Talmor et al., 2019) datasets. By exploiting He got a hole in his sock. [SEP] ! our scoring method without fine-tuning, we incorrect are able to produce strong baselines (e.g. 80% test accuracy on COPA) that are comparable The special token [CLS] (used for sentence level to supervised approaches. Moreover, when tasks) is then provided to a classifier in order to fine-tuning directly on the proposed scoring predict the label of the given example; [SEP] function, we show that our method provides is a special separator token. This format will be a much more stable training phase across ran- referred to as separated-sentence. For such a dom restarts (e.g ×10 standard deviation re- duction on COPA test accuracy) and requires task, the use of the randomly initialized head can less annotated data than the standard classifier appear sub-optimal since the pre-trained model approach to reach equivalent performances. does not integrate any prior on the specific classifier label. To validate this intuition, we cast the 1 Introduction MLM model inputs into a full-text format. Thus, the separation token is dropped and potentially Recent advances in natural language processing replaced by conjunction words that are fully have been made using sequential transfer learning specific to the task. The previously illustrated over large pre-trained transformer models. From correct example will be turned into: [CLS] these models, most NLP tasks can be addressed The man broke his toe because by adding a classifier on top of the transformer he dropped a hammer on his foot embedding outputs (Devlin et al., 2019; Liu et al., [SEP]. Using this input format, we apply a new 2019). bidirectional word-level scoring function that ∗Equal contribution. leverages the MLM head (Devlin et al., 2019) 3878 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3878–3887 July 5 - 10, 2020. c 2020 Association for Computational Linguistics tuned during the pre-training phase (see Figure1 cross-entropy loss with a margin-based one one for an overview of the proposed approach). This the COPA dataset. The authors argued that cross- method produces strong zero-shot1 baselines on entropy based methods are not adapted for plausi- the COPA (Gordon et al., 2012), Swag (Zellers bility ranking tasks since they force the scores to et al., 2018), HellaSwag (Zellers et al., 2019) and adopt extreme values (near 0 or 1). In contrast, a CommonsenseQA (Talmor et al., 2019) datasets. margin-based objective function appeared to be a Then, we fine-tune this new scoring function with natural way to rank a set of hypotheses. Both ap- a margin-based loss as proposed in (Li et al., proaches were compared using the [CLS] token 2019). Using RoBERTaLARGE, our results reveal of the BERT-base model and a separated-sentence that this new training procedure leads to better ac- input format. The margin-based objective function curacy and much more stable training trajectories surpassed the cross-entropy one by increasing the which is an important feature since large MLM Test set accuracy from 73.4% to 75.4%. models are known to be unstable on several tasks Adopting a token level scoring approach (Koci- (Devlin et al., 2019; Phang et al., 2018). Finally, jan et al., 2019) used a BERT model with a mix- we find that a progressive decrease of the training ture between a margin-based and a MLM loss on dataset size results in a progressive increase of WSC-273 to score the different pronouns to dis- the accuracy gap between our proposed method ambiguate. This approach allows the authors to and the standard classifier ones. This makes our improve the previous state of the art by 8.8%. method advantageous in small dataset context. Despite being the closest method to the one proposed in this paper, our approach differs from 2 Related Work three points: In (Trinh and Le, 2018), researchers have shown • We generalize the scoring method by target- that a RNN Language Model pretrained on a large ing different contiguous sub-sequences for amount of data can be used to efficiently score the likelihood estimation. To do so, different sentences in a zero-shot setting. They used the datasets are recasted in a full-text format. Winograd Schema Challenge (WSC-273) dataset (Levesque et al., 2012) which mostly consists of • We also focus on targeting the premise avoid- a pronoun disambiguation task that requires com- ing inner statistical biases of different hy- monsense reasoning. In their approach, the pro- potheses (e.g. word frequencies, punctuation, noun to disambiguate is replaced by the different variable sequence lengths etc...). candidates. Then, each version of the sentence is • The objective of the present paper is to pro- scored using the likelihood of the sequence un- pose a direct comparison in terms of accuracy der the forward autoregressive factorization. They and training stability across random restarts showed that targeting the likelihood of the tokens between the proposed method and standard placed after the candidate words performs better classifers. than a full-sentence likelihood estimation. This result highlights the fact that the choice of the targeted sub-sequence for the likelihood estima- 3 Method tion has an important impact on the overall perfor- 3.1 Problem Formulation mance of the model. More recently, analysis of re- Given an input premise p = (p(1); p(2); : : : ; p(Lp)), lational knowledge contained in pre-trained BERT and a set of candidate hypotheses: models has been the subject of different studies (Petroni et al., 2019; Poerner et al., 2019). Results n (1) (2) (Li) o H = hi = (hi ; hi ; : : : ; hi ) ; have shown evidences that BERT models memo- i=1:::n rize reasoning about entity names and common- we aim to identify the fitting hypothesis h∗ 2 sense knowledge, making MLM models appropri- H which correctly matches p. The values ate candidates to commonsense oriented tasks. Lp and fLigi=1:::n are the sequence lengths of From a supervised learning perspective, (Li premise and hypotheses respectively. In a com- et al., 2019) proposed to replace the traditional monsense settings, such problem corresponds to 1For the following of our paper, we will note as zero-shot find premise-hypothesis implications by exploit- setting the use of the pre-trained model without fine-tuning. ing some prior commonsense knowledge. Since 3879 SSM SSM shared weights 1 2 3 4 5 Ro a Ro a 1 [CLS] [MASK] man broke his toe because he dropped a hammer on his foot [SEP] ... 2 [CLS] The [MASK] broke his toe because he dropped a hammer on his foot [SEP] ... 3 [CLS] The man [MASK] his toe because he dropped a hammer on his foot [SEP] ... 4 [CLS] The man broke [MASK] toe because he dropped a hammer on his foot [SEP] ... 5 [CLS] The man broke his [MASK] because he dropped a hammer on his foot [SEP] ... Figure 1: Overview of the proposed method for the task t = COPA. Two full-text sequences (Section 3.1), strue and sfalse, are given as input (gold and distractor premise/hypothesis pairs respectively). Circled numbers explicitly mark input and output of five different versions of a given sentence, where each has a different premise word (k) (k) (k) np masked. The output probabilities Pi = P (p j si ) contribute to the score computation (target premise p score Si in this example, see Section 3.2). When fine-tuning on the task is performed, gold and distractor scores are used for margin-based loss computation (Section 3.3). our scoring method consumes input sequences in der to compute its result.

Load more