<<

Pre-training Is () All You Need: An Application to Commonsense Reasoning

Alexandre Tamborrino∗, Nicola Pellicano`∗, Baptiste Pannier∗, Pascal Voitot, Louise Naudin Samsung Strategy and Innovation Center {a.tamborrino,n.pellicano,b.pannier,p.voitot,l.naudin}@samsung.com

Abstract In this paper, we tackle a of NLP tasks consisting in plausibility ranking. Such tasks can Fine-tuning of pre-trained transformer models be formalised as follows: given a unique premise has become the standard approach for solv- p and a of hypotheses H = {h } , the ing common NLP tasks (Devlin et al., 2019). i i=1...n task consists in returning the appropriate hypothe- Most of the existing approaches rely on a ran- ∗ domly initialized classifier on top of such net- sis h ∈ H that matches p (see Section3 for more works. We argue that this fine-tuning proce- details). A natural task that fits into this problem dure is sub-optimal as the pre-trained model formulation is commonsense reasoning. Thus, it has no prior on the specific classifier labels, will be the main focus of the present paper. while it might have already learned an intrinsic Traditionally, this problem is solved by jointly textual representation of the task. In this paper, classifying each pair (p, hi)i=1...n. For instance, we introduce a new scoring method that casts assuming a Masked Language Modeling (MLM) a plausibility ranking task in a full-text format and leverages the masked language modeling model is used, an example from the COPA dataset head tuned during the pre-training phase. We (Gordon et al., 2012) is commonly casted into two study commonsense reasoning tasks where the distinct examples: model must rank a set of hypotheses given a • [CLS] The man broke his toe. [SEP] premise, focusing on the COPA (Gordon et al., He dropped a hammer on his foot. 2012), Swag (Zellers et al., 2018), HellaSwag [SEP] → correct (Zellers et al., 2019) and CommonsenseQA • [CLS] The man broke his toe. [SEP] (Talmor et al., 2019) datasets. By exploiting He got a hole in his sock. [SEP] → our scoring method without fine-tuning, we incorrect are able to produce strong baselines (e.g. 80% test accuracy on COPA) that are comparable The special token [CLS] (used for sentence level to supervised approaches. Moreover, when tasks) is then provided to a classifier in order to fine-tuning directly on the proposed scoring predict the label of the given example; [SEP] function, we show that our method provides is a special separator token. This format will be a much more stable training phase across ran- referred to as separated-sentence. For such a dom restarts (e.g ×10 standard deviation re- duction on COPA test accuracy) and requires task, the use of the randomly initialized head can less annotated data than the standard classifier appear sub-optimal since the pre-trained model approach to reach equivalent performances. does not integrate any prior on the specific clas- sifier label. To validate this intuition, we cast the 1 Introduction MLM model inputs into a full-text format. Thus, the separation token is dropped and potentially Recent advances in natural language processing replaced by conjunction words that are fully have been made using sequential transfer learning specific to the task. The previously illustrated over large pre-trained transformer models. From correct example will be turned into: [CLS] these models, most NLP tasks can be addressed The man broke his toe because by adding a classifier on top of the transformer he dropped a hammer on his foot embedding outputs (Devlin et al., 2019; Liu et al., [SEP]. Using this input format, we apply a new 2019). bidirectional word-level scoring function that ∗Equal contribution. leverages the MLM head (Devlin et al., 2019)

3878 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3878–3887 July 5 - 10, 2020. c 2020 Association for Computational Linguistics tuned during the pre-training phase (see Figure1 cross-entropy loss with a margin-based one one for an overview of the proposed approach). This the COPA dataset. The authors argued that cross- method produces strong zero-shot1 baselines on entropy based methods are not adapted for plausi- the COPA (Gordon et al., 2012), Swag (Zellers bility ranking tasks since they force the scores to et al., 2018), HellaSwag (Zellers et al., 2019) and adopt extreme values (near 0 or 1). In contrast, a CommonsenseQA (Talmor et al., 2019) datasets. margin-based objective function appeared to be a Then, we fine-tune this new scoring function with natural way to rank a set of hypotheses. Both ap- a margin-based loss as proposed in (Li et al., proaches were compared using the [CLS] token 2019). Using RoBERTaLARGE, our results reveal of the BERT-base model and a separated-sentence that this new training procedure leads to better ac- input format. The margin-based objective function curacy and much more stable training trajectories surpassed the cross-entropy one by increasing the which is an important feature since large MLM Test set accuracy from 73.4% to 75.4%. models are known to be unstable on several tasks Adopting a token level scoring approach (Koci- (Devlin et al., 2019; Phang et al., 2018). Finally, jan et al., 2019) used a BERT model with a mix- we find that a progressive decrease of the training ture between a margin-based and a MLM loss on dataset size results in a progressive increase of WSC-273 to score the different pronouns to dis- the accuracy gap between our proposed method ambiguate. This approach allows the authors to and the standard classifier ones. This makes our improve the previous state of the art by 8.8%. method advantageous in small dataset context. Despite being the closest method to the one pro- posed in this paper, our approach differs from 2 Related Work three points: In (Trinh and Le, 2018), researchers have shown • We generalize the scoring method by target- that a RNN Language Model pretrained on a large ing different contiguous sub-sequences for amount of data can be used to efficiently score the likelihood estimation. To do so, different sentences in a zero-shot setting. They used the datasets are recasted in a full-text format. Winograd Schema Challenge (WSC-273) dataset (Levesque et al., 2012) which mostly consists of • We also focus on targeting the premise avoid- a pronoun disambiguation task that requires com- ing inner statistical biases of different hy- monsense reasoning. In their approach, the pro- potheses (e.g. word frequencies, punctuation, noun to disambiguate is replaced by the different variable sequence lengths etc...). candidates. Then, each version of the sentence is • The objective of the present paper is to pro- scored using the likelihood of the sequence un- pose a direct comparison in terms of accuracy der the forward autoregressive factorization. They and training stability across random restarts showed that targeting the likelihood of the tokens between the proposed method and standard placed after the candidate words performs better classifers. than a full-sentence likelihood estimation. This result highlights the fact that the choice of the targeted sub-sequence for the likelihood estima- 3 Method tion has an important impact on the overall perfor- 3.1 Problem Formulation mance of the model. More recently, analysis of re- Given an input premise p = (p(1), p(2), . . . , p(Lp)), lational knowledge contained in pre-trained BERT and a set of candidate hypotheses: models has been the subject of different studies

(Petroni et al., 2019; Poerner et al., 2019). Results n (1) (2) (Li) o H = hi = (hi , hi , . . . , hi ) , have shown evidences that BERT models memo- i=1...n rize reasoning about entity names and common- we aim to identify the fitting hypothesis h∗ ∈ sense knowledge, making MLM models appropri- H which correctly matches p. The values ate candidates to commonsense oriented tasks. Lp and {Li}i=1...n are the sequence lengths of From a supervised learning perspective, (Li premise and hypotheses respectively. In a com- et al., 2019) proposed to replace the traditional monsense settings, such problem corresponds to 1For the following of our paper, we will note as zero-shot find premise-hypothesis implications by exploit- setting the use of the pre-trained model without fine-tuning. ing some prior commonsense knowledge. Since

3879 SSM SSM

shared weights

1 2 3 4 5

Ro a Ro a

1 [CLS] [MASK] man broke his toe because he dropped a hammer on his foot [SEP] ...

2 [CLS] The [MASK] broke his toe because he dropped a hammer on his foot [SEP] ...

3 [CLS] The man [MASK] his toe because he dropped a hammer on his foot [SEP] ...

4 [CLS] The man broke [MASK] toe because he dropped a hammer on his foot [SEP] ...

5 [CLS] The man broke his [MASK] because he dropped a hammer on his foot [SEP] ...

Figure 1: Overview of the proposed method for the task t = COPA. Two full-text sequences (Section 3.1), strue and sfalse, are given as input (gold and distractor premise/hypothesis pairs respectively). Circled explicitly mark input and output of five different versions of a given sentence, where each has a different premise word (k) (k) (k) \p masked. The output probabilities Pi = P (p | si ) contribute to the score computation (target premise p score Si in this example, see Section 3.2). When fine-tuning on the task is performed, gold and distractor scores are used for margin-based loss computation (Section 3.3). our scoring method consumes input sequences in der to compute its result. Let us consider the mask- a full-text format (see Section 3.2), our method is ing of a word w which contributes to make sense formulated on a commonsense task but not limited of the matching between p and hi. The intuition to it. is that the confidence of the network in recover- ing such word is directly related to the score of 3.2 Sequence Scoring Method hp, hii. Let us define, inspired by the notation of The proposed Sequence Scoring Method (SSM), \w (Song et al., 2019), si as the sentence si with the takes as input a pair hp, hii returns a score repre- tokens of w replaced by the [MASK] token. senting the likelihood of hi of being implied by p. The target premise score is calculated as fol- First, a transform operator T converts hp, hii lows: pair into a full-text input. Such operator, in Lp X h  \p(k) i it’s simplest form, just concatenates the two se- Sp = log P p(k) | s , (2) quences. However, in general T can be con- i i k=1 strained on the task t. where premise words are masked one by one in t t t t si = T (p, hi) = (cl, p, cm, hi, cr), (1) order to compute their relevance with respect to the given hypothesis. Masked word probability is t where si is the resulting full-text input, while cl, estimated from direct inference on a model pre- t t cm, and cr are left, middle and right conjunction trained on MLM task. The computational com- sequences of the task. For example, Swag will plexity of such method grows linearly with Lp (re- have no conjunction, since the correct hypothesis quiring Lp examples per forward pass). Alterna- is the natural continuation of the premise, while tively, the target hypothesis score is computed as: COPA will have because/so middle conjunctions due to its cause/effect nature (see Section4). L i   (k)  Given the full-text input, the scorer aims to ex- h 1 X (k) \hi Si = log P hi | si . (3) Li ploit the pre-training task of word masking in or- k=1 3880 The target hypothesis score needs normalization task. The different masked inputs needed to com- n \p(j) o by Li in order to allow comparison between vari- pute the target premise score, si , are j=1..Lp able candidate hypothesis length. The best hy- p pothesis will be taken as the one maximizing the batched together in order to compute score Si in target premise (or hypothesis) score: one forward pass. The model acts as a siamese network that performs independent computation ∗ p p h = hj ∈ H s.t. max S = S . (4) of target premise score for each hypothesis hi. i=1...n i j Loss function As demonstrated in Section 5.2, the target premise score allows for a fairer comparison between dif- As already noted in (Li et al., 2019), multiple ferent hypotheses. In fact, they present inher- choice tasks (e.g. COPA) are more naturally ex- ent differences in terms of statistical frequency of pressed as learning to rank problems. For this words, sequence length or may exhibit more or reason we adopt as objective function a margin- less strong inter-dependency between words (e.g. based loss in contrast to cross-entropy loss. Given ∗ composite words reinforce each other confidence). ground truth sentence index i , the loss is specified Such variance could introduce a bias in the rela- as: tive significance of each hypothesis alone (inde- n 1 X p p pendently from the premise). On the opposite, L = max (0, η − S ∗ + S ) , (6) n i i different probabilities on the same target premise i=1 ∗ word can only be affected by the change of hy- i6=i pothesis context. where η is a margin threshold hyperparameter. N-grams sequence scoring According to our preliminary experiments, we We can extend the proposed SSM by scoring the do not add a second MLM component in the gen- reconstruction not only of single words, but of en- eral loss (as in (Kocijan et al., 2019)), since it al- tire n-grams. Adding n-grams probabilities to the ways leads to a decrease of the model performance logarithmic mean combination not only robustifies for various weighted contributions of the MLM the scoring methods, but helps to better model the term. joint probability of (dependent) close words, espe- cially in a zero-shot setting. Let us note as p(u:v) as 4 Datasets the sub-sequence of p spanning between indexes u The commonsense reasoning datasets that we fo- and v (included). The partial target premise score cus on are COPA (Gordon et al., 2012), Swag for g-grams (i.e. mask windows of size g) can be (Zellers et al., 2018), HellaSwag (Zellers et al., expressed as: 2019) and CommonsenseQA (Talmor et al., 2019).

Lp−g+1 All these datasets share the premise-hypothesis p,g X h  (k:k+g−1) \p(k:k+g−1) i task format. Table1 shows examples of full- Si = log P p | si . k=1 text format and separated-sentence format for all datasets. By definition the target premise score in Equa- tion2 is equivalent to 1-gram partial target premise COPA p p,1 score (i.e. Si , Si ). The n-gram sequence scor- COPA (Choice of Plausible Alternatives) (Gordon ing accumulates masked language model probabil- et al., 2012) is a commonsense causal reasoning ities from every gram size till n. task where two candidate hypotheses are given. n COPA itself is composed of two sub-tasks: effect p,[1,n] X p,g samples and cause samples. The effect and cause Si = Si . (5) g=1 samples have respectively implies and implied by relation with the correct hypothesis. The full-text 3.3 SSM-based fine-tuning format of COPA is built by using the conjunction The proposed score function, since it does not im- words because (resp. so) as middle conjunc- ply any addition of a head module, can be directly tions for cause (resp. effect) samples. Concern- applied without any retraining (see Section 5.2). It ing the separated-sentence format, we reverse the can also be directly used when fine-tuning on the premise and hypothesis order for cause samples in

3881 Dataset Full-text format Separated-sentence format COPA (effect) [CLS] I knocked on my neighbor’s door so my [CLS] I knocked on my neighbor’s door. neighbor invited me in. [SEP] [SEP] My neighbor invited me in. [SEP] COPA (cause) [CLS] The man broke his toe because he [CLS] He dropped a hammer on his foot. dropped a hammer on his foot. [SEP] [SEP] The man broke his toe. [SEP] CommonsenseQA [CLS] Q: Where on a river can you hold a cup [CLS] Q: Where on a river can you hold a cup upright to catch water on a sunny day? A: wa- upright to catch water on a sunny day? [SEP] terfall [SEP] A: waterfall [SEP] Swag [CLS] We notice a man in a kayak and a yel- [CLS] We notice a man in a kayak and a yel- low helmet coming in from the left. As he ap- low helmet coming in from the left. [SEP] proaches, his kayak flips upside-down. [SEP] As he approaches, his kayak flips upside-down. [SEP] HellaSwag [CLS] A man is standing in front of a camera. [CLS] A man is standing in front of a camera. He starts playing a harmonica for the camera. He starts playing a harmonica for the camera. He rocks back and forth to the music as he goes. [SEP] He rocks back and forth to the music as [SEP] he goes. [SEP]

Table 1: Examples of full-text format and separated-sentence format for gold premise-hypothesis pairs. Left t t conjunction cl is highlighted in italic blue, middle conjunction cm in bold red. order to convert all cause samples into effect sam- swers are created via Adversarial Filtering: gener- ples. This has the benefit to present a unique task ated by language modeling models and filtered by to the model, and our experiments show that this discriminator models. HellaSwag (Zellers et al., give better results than keeping cause samples and 2019) is an evolved version of Swag using better effect samples unmodified. We choose the Super- generators and discriminators models for Adver- GLUE split (Wang et al., 2019). sarial Filtering. Since the benchmark test set is private, we evaluate our zero-shot setting on the CommonsenseQA Val set (we do not perform a fine-tuning study on CommonsenseQA (Talmor et al., 2019) is a Swag and HellaSwag as explained in Section 5.3). multiple-choice commonsense question answer- ing dataset where each question has one correct 5 Experiments answer and four distractor answers. To create the In this section we first apply our scoring method Q: full-text format, we prepend to the question, in a zero-shot setting on the four aforementioned A: to the answer, and then concatenate the ques- datasets. Then we fine-tune our scoring method tion and the answer ( stands for space charac- while varying the percentage of the training data ter). For the separated-sentence format, we also used and compare it to approaches that use a Q: A: use the and prefixes to follow the best randomly initialized classifier head. We use recommendation from the FairSeq repo on how to RoBERTaLARGE (Liu et al., 2019) for our pre- fine-tune RoBERTa on CommonsenseQA 2. Since trained model as RoBERTaLARGE fine-tuned with the benchmark Test set is private, for our zero- a classification layer on top has very competitive shot and fine-tuning stability studies we have split results on those datasets. Our implementation use the original validation set evenly, treating last 611 ∗ PyTorch and the HuggingFace Transformers li- samples as Test set Test . brary (Wolf et al., 2019).

Swag and HellaSwag 5.1 Task probing Swag (Situations With Adversarial Generations) Before assessing our zero-shot and fine-tuning re- (Zellers et al., 2018) is a multiple choice com- sults, we perform a task probing by evaluating monsense dataset about grounded situations. Each the zero-shot score we obtain by removing the premise is a video caption with four answer premise from the input and only scoring the hy- choices about what might happen next in the potheses. If the score is significantly better than scene. The correct answer is the video caption for a random baseline, it means that the task is not the next event in the video. The other negative an- actually solved by commonsense reasoning, but 2https://github.com/pytorch/fairseq/tree/ by using statistical biases in the hypotheses. This master/examples/roberta/commonsense qa probing method has been already used on several

3882 Dataset Mode Acc1 (%) Target Grams Test Acc (%) COPA hyp-only 54.6 premise 1 74.0 random 50.0 hypothesis 1 69.8 CommonsenseQA hyp-only 22.0 premise 2 76.2 random 20.0 premise 3 79.0 Swag hyp-only 60.6 premise 4 80.0 random 25.0 premise 5 79.4 HellaSwag hyp-only 50.8 Table 3: COPA zero-shot results. random 25.0 ∗ Table 2: Commonsense reasoning task probing. hyp- Target Grams Test Acc (%) only stands for hypothesis only, random for random premise 1 47.8 baseline. 1COPA is evaluated on Test, Common- hypothesis 1 37.4 senseQA is evaluated on Test∗, Swag and HellaSwag premise 2 53.2 are evaluated on Val (see Section4). premise 3 53.7 premise 4 56.1 datasets to show that the underlying task was not premise 5 55.2 really solved by the top-performing models (Niven Table 4: CommonsenseQA zero-shot results. and Kao, 2019; Zellers et al., 2019). The results of the task probing evaluation are reported in Table2. While COPA and Com- give increasingly better results but the trend in- monsenseQA have a hypothesis only score close verts after 4-grams, which may be due to the to the random baseline, the score of both Swag fact that masked models are not trained to mask and HellaSwag are significantly higher than their large chunks of text. It is interesting to note that random baseline (more than twice). This con- our zero-shot result is significantly better than a firms the study from (Zellers et al., 2019) that BERTLARGE cross-entropy model fined-tuned on shows that Swag’s false hypotheses were gener- the COPA training set (80.0% vs. 70.6% accu- ated using a weak generator, therefore the au- racy) (Wang et al., 2019), while being comparable 3 thors argue that the fine-tuning process on a BERT for CommonsenseQA . Moreover, when we in- model on Swag learns to pick up the statistical tentionally switch the so and because conjunc- cues left by the weak generator. Our results show tion words on COPA to make the samples erro- that RoBERTaLARGE can leverage these distribu- neous, the accuracy drops significantly (64.4%). tional biases without the fine-tuning phase. We ar- We reckon this is an indicator that our scoring gue that the human-written pre-training corpora of method effectively reuse the pre-learned represen- RoBERTa biases it to give better score to human- tation the full-text format of the task. written language rather than model-generated sen- Concerning Swag and HellaSwag, the target hy- tences. As shown in (Holtzman et al., 2019), there pothesis mode is significantly better than the target is indeed still a strong distributional differences premise mode (see Table5), as expected from our between human text and machine text. Further- task probing work in Section 5.1. For example, more, our result also highlights that HellaSwag on HellaSwag, the target hypothesis mode is only still exhibits a strong bias due to its generation 8% better than the hypothesis only mode (58.8% scheme when evaluated with RoBERTaLARGE. versus 50.8%), which confirms that on this setting our zero-shot method is mainly taking advantage 5.2 Zero-shot Results of the bias in the hypotheses. Therefore we refrain For both COPA and CommonsenseQA, the best from doing more zero-shot experiments on both performing scoring method uses the target premise datasets. and 4-grams settings as shown in Tables3 and4. 5.3 Fine-tuning Results Targeting the premise gives better results than tar- geting the hypothesis, which reinforces our argu- Following the strong bias of Swag and HellaSwag ment that targeting the hypothesis may be harder that was shown in Section 5.1 using our scoring as the differences between the hypotheses make method with RoBERTaLARGE, we decide to not the score comparison noisier. Also, more grams 3https://www.tau-nlp.org/csqa-leaderboard

3883 1.0 0.95 Head CE Head Margin 0.9 Ours 0.90

0.8 0.85

0.7 0.80 Accuracy

0.6 Best accuracy 0.75 Head CE Head Margin 0.5 0.70 Ours Ours (ZS 1-gram) Ours (ZS 4-gram) 0.4 0.65 10.0% 50.0% 100.0% 10.0% 50.0% 100.0% % of training set used % of training set used (a) (b)

Figure 2: COPA fine-tuning results on Test set. The whole training set corresponds to 400 examples.

0.8 0.80 Head CE Head Margin 0.7 Ours 0.70

0.6

0.60 0.5

Accuracy 0.4 0.50 Best accuracy

Head CE 0.3 0.40 Head Margin Ours Ours (ZS 1-gram) 0.2 Ours (ZS 4-gram) 0.30 1.0% 5.0% 10.0% 100.0% 1.0% 10.0% 100.0% % of training set used % of training set used (a) (b)

Figure 3: CommonsenseQA fine-tuning results on Test∗ set. The whole training set corresponds to 9741 examples.

Dataset Target Val Acc (%) format (ours). Swag premise 48.3 Swag hypothesis 72.5 • A randomly initialized classifier with cross- HellaSwag premise 37.1 entropy loss and separated-sentence format HellaSwag hypothesis 58.8 (head CE). The cross-entropy loss is com- puted on the probability of the correct can- Table 5: Swag/HellaSwag zero-shot results (1-Gram). didate, normalized over all candidates in the set (see 1 in (Li et al., 2019)). include them into our fine-tuning study to be sure • A randomly initialized classifier with margin- to compare results for which models learn the ac- based loss and full-text format (head margin) tual premise-hypothesis commonsense reasoning task. The head margin setting is an ablated version of our scoring method to verify that our reuse Comparison settings of the MLM head actually provides a significant In order to make fair comparisons, we train and advantage over a randomly initialized head. For compare three different model settings: our method, we report results only for the best performing scoring method which is the target • Our scoring method with target premise premise mode. Experiments showed us that vary- mode, 1-gram, margin-based loss, full-text ing the of grams produce comparable re-

3884 sults, so we use the 1-gram setting for computa- perGLUE leaderboard5 (Wang et al., 2019) be- tional efficiency. We reckon that the enriched bi- tween RoBERTaLARGE (Liu et al., 2019) and the directional context granted by N-gram score can T5 model composed of 11 billions of parameters be directly learned when fine-tuning on the task. (Raffel et al., 2019) (respectively 90.6 and 94.8 % For each dataset, we train the three model set- accuracy on the Test set). tings for 20 random seeds each. For each seed, We also notice that our method provides a much we pick the best performing model on the vali- more stable training relative to the random seed dation set and report its accuracy on the Test set. as shown by the box plots in Figure2 a) and3 We then compute the max accuracy, mean accu- a). When training on the full COPA dataset, our racy and standard deviation of each model setting method exhibits a ×10 standard deviation reduc- on the Test set. For all model settings, following tion on the test accuracy compared to the head CE the recommended hyper-parameters to fine-tune setting (1.35% versus 12.8%). Our intuition is that RoBERTaLARGE (Liu et al., 2019), we set a learn- the improved stability is due to the better reuse ing rate of 1e-5, a warm-up ratio of 6% of the total of the pre-trained model priors and the absence of number of training steps, a linear learning rate de- new randomly initialized weights. This is impor- cay and a weight decay of 0.01. We use a batch tant result towards easier experiment comparisons size of 8 for COPA (4 for the 10% training per- as fine-tuning BERT-like architectures is known centage setting) and 16 for CommonsenseQA. For to be unstable across random restarts as shown in the margin-based loss (ours and head margin), we (Phang et al., 2018). set η = 0.5 after a few trials.

COPA and CommonsenseQA results 6 Conclusions On both COPA and CommonsenseQA, our method outperforms both the head CE and head In this work, we presented a new method for plau- margin methods in terms of mean accuracy and sibility ranking tasks, specifically targeting com- max/best accuracy (see Figure2 and Figure3). monsense ranking problem. We define a scoring Moreover, we find that a progressive decrease of function that leverages the MLM head of large the training dataset size results in a progressive pre-trained bidirectional transformer models. We increase of the best accuracy gap between our establish strong results in a zero-shot setting on method and the other ones. This confirms our in- four commonsense reasoning datasets, compara- tuition that our methods is the most advantageous ble to supervised approaches. We then fine-tune when few training data is available. such model using a margin-based loss on the pro- For example, when using 1% of training data posed scoring function, and provide a compara- of CommonsenseQA, our method achieves an ac- tive study with state of the art randomly initialized curacy of 56.7% on the Test∗ set (vs. 40.2% for head methods. Our study demonstrates that the di- the head CE approach). Using the whole training rect use of MLM over custom head yields increas- data, our approach still outperforms other meth- ingly superior performance gain when decreasing ods but by a lower margin (76.4% accuracy ver- training data size. The proposed approach outper- sus 75.4% for head CE). In addition, when evalu- forms state-of-the-art training methods in terms of ated on the CommonsenseQA private Test set, our both test accuracy and training stability. approach gets 71.6% accuracy which is close to RoBERTaLARGE cross-entropy (Liu et al., 2019) Future works include applying such scoring under an important hyper-parameter grid search4 method on broader classification tasks like Natu- (72.1% accuracy). ral Language Inference and Sentiment Analysis. When using 100% of the COPA training set We also think that our token-level scoring method (400 train samples), our method outperforms the could be used during the self-supervised pre- head CE setting per 5 points and the head mar- training phase to extend traditional next sentence gin setting per 3 points, achieving an accuracy prediction and sequence ordering tasks, bringing of 92.4% on the Test set. This result allows our more commonsense knowledge in the model. approach to reach the second place in the Su-

4https://github.com/pytorch/fairseq/tree/ master/examples/roberta/commonsense qa 5https://super.gluebenchmark.com/leaderboard

3885 References knowledge bases? In Proceedings of the 2019 Con- ference on Empirical Methods in Natural Language Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Processing and the 9th International Joint Confer- Kristina Toutanova. 2019. BERT: Pre-training of ence on Natural Language Processing (EMNLP- deep bidirectional transformers for language under- IJCNLP), pages 2463–2473, Hong Kong, China. As- standing. In Proceedings of the 2019 Conference sociation for Computational Linguistics. of the North American Chapter of the Association for Computational Linguistics: Human Language Jason Phang, Thibault Fevry,´ and Samuel R Bowman. Technologies, Volume 1 (Long and Short Papers), 2018. Sentence encoders on stilts: Supplementary pages 4171–4186, Minneapolis, Minnesota. Associ- training on intermediate labeled-data tasks. arXiv ation for Computational Linguistics. preprint arXiv:1811.01088.

Andrew Gordon, Zornitsa Kozareva, and Melissa Nina Poerner, Ulli Waltinger, and Hinrich Schutze.¨ Roemmele. 2012. SemEval-2012 task 7: Choice 2019. Bert is not a knowledge base (yet): Fac- of plausible alternatives: An evaluation of common- tual knowledge vs. name-based reasoning in unsu- sense causal reasoning. In *SEM 2012: The First pervised qa. arXiv preprint arXiv:1911.03681. Joint Conference on Lexical and Computational Se- mantics – Volume 1: Proceedings of the main con- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine ference and the shared task, and Volume 2: Pro- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, ceedings of the Sixth International Workshop on Se- Wei Li, and Peter J Liu. 2019. Exploring the limits mantic Evaluation (SemEval 2012), pages 394–398, of transfer learning with a unified text-to-text trans- Montreal,´ Canada. Association for Computational former. arXiv preprint arXiv:1910.10683. Linguistics. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie- Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Yan Liu. 2019. MASS: Masked sequence to se- Choi. 2019. The curious case of neural text degen- quence pre-training for language generation. In eration. ArXiv, abs/1904.09751. Proceedings of the 36th International Conference on Machine Learning Proceedings Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, , volume 97 of of Machine Learning Research Yordan Yordanov, and Thomas Lukasiewicz. 2019. , pages 5926–5936, A surprisingly robust trick for the Winograd schema Long Beach, California, USA. PMLR. challenge. In Proceedings of the 57th Annual Meet- Alon Talmor, Jonathan Herzig, Nicholas Lourie, and ing of the Association for Computational Linguis- Jonathan Berant. 2019. CommonsenseQA: A ques- tics, pages 4837–4842, Florence, Italy. Association tion answering challenge targeting commonsense for Computational Linguistics. knowledge. In Proceedings of the 2019 Conference Hector J. Levesque, Ernest Davis, and Leora Morgen- of the North American Chapter of the Association stern. 2012. The winograd schema challenge. In for Computational Linguistics: Human Language Proceedings of the Thirteenth International Con- Technologies, Volume 1 (Long and Short Papers), ference on Principles of Knowledge Representa- pages 4149–4158, Minneapolis, Minnesota. Associ- tion and Reasoning, KR’12, pages 552–561. AAAI ation for Computational Linguistics. Press. Trieu H. Trinh and Quoc V. Le. 2018. A sim- Zhongyang Li, Tongfei Chen, and Benjamin ple method for commonsense reasoning. ArXiv, Van Durme. 2019. Learning to rank for plau- abs/1806.02847. sible plausibility. In Proceedings of the 57th Annual Meeting of the Association for Computational Alex Wang, Yada Pruksachatkun, Nikita Nangia, Linguistics, pages 4818–4823, Florence, Italy. Amanpreet Singh, Julian Michael, Felix Hill, Omer Association for Computational Linguistics. Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose lan- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- guage understanding systems. In H. Wallach, dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, H. Larochelle, A. Beygelzimer, F. dAlche-Buc,´ Luke S. Zettlemoyer, and Veselin Stoyanov. 2019. E. Fox, and R. Garnett, editors, Advances in Neu- Roberta: A robustly optimized bert pretraining ap- ral Information Processing Systems 32, pages 3261– proach. ArXiv, abs/1907.11692. 3275. Curran Associates, Inc. Timothy Niven and Hung-Yu Kao. 2019. Probing neu- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien ral network comprehension of natural language ar- Chaumond, Clement Delangue, Anthony Moi, Pier- guments. In Proceedings of the 57th Annual Meet- ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- ing of the Association for Computational Linguis- icz, and Jamie Brew. 2019. Huggingface’s trans- tics, pages 4658–4664, Florence, Italy. Association formers: State-of-the-art natural language process- for Computational Linguistics. ing. ArXiv, abs/1910.03771. Fabio Petroni, Tim Rocktaschel,¨ Sebastian Riedel, Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Yejin Choi. 2018. SWAG: A large-scale adversar- Alexander Miller. 2019. Language models as ial dataset for grounded commonsense inference.

3886 In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 93–104, Brussels, Belgium. Association for Compu- tational Linguistics. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a machine really finish your sentence? In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Lin- guistics.

3887