Visualizing and Understanding the Effectiveness of BERT

Visualizing and Understanding the Effectiveness of BERT Yaru Haoy∗, Li Dongz, Furu Weiz, Ke Xuy yBeihang University zMicrosoft Research fhaoyaru@,[email protected] flidong1,[email protected] Abstract is effective on downstream tasks in terms of both trainability and generalization capability. In this Language model pre-training, such as BERT, has achieved remarkable results in many NLP work, we take BERT (Devlin et al., 2018) as an tasks. However, it is unclear why the pre- example to understand the effectiveness of pre- training-then-fine-tuning paradigm can im- training. We visualize the loss landscapes and prove performance and generalization capabil- the optimization procedure of fine-tuning on spe- ity across different tasks. In this paper, we pro- cific datasets in three ways. First, we compute the pose to visualize loss landscapes and optimiza- one-dimensional (1D) loss curve, so that we can tion trajectories of fine-tuning BERT on spe- inspect the difference between fine-tuning BERT cific datasets. First, we find that pre-training reaches a good initial point across downstream and training from scratch. Second, we visualize tasks, which leads to wider optima and eas- the two-dimensional (2D) loss surface, which pro- ier optimization compared with training from vides more information about loss landscapes than scratch. We also demonstrate that the fine- 1D curves. Third, we project the high-dimensional tuning procedure is robust to overfitting, even optimization trajectory of fine-tuning to the ob- though BERT is highly over-parameterized for tained 2D loss surface, which demonstrate the downstream tasks. Second, the visualization learning properties in an intuitive way. results indicate that fine-tuning BERT tends to generalize better because of the flat and wide The main findings are summarized as follows. optima, and the consistency between the train- First, visualization results indicate that BERT pre- ing loss surface and the generalization error training reaches a good initial point across down- surface. Third, the lower layers of BERT are stream tasks, which leads to wider optima on the more invariant during fine-tuning, which sug- 2D loss landscape compared with random initial- gests that the layers that are close to input learn ization. Moreover, the visualization of optimiza- more transferable representations of language. tion trajectories shows that pre-training results in 1 Introduction easier optimization and faster convergence. We also demonstrate that the fine-tuning procedure is Language model pre-training has achieved strong robust to overfitting. Second, loss landscapes of performance in many NLP tasks (Peters et al., fine-tuning partially explain the good generaliza- 2018; Howard and Ruder, 2018a; Radford et al., tion capability of BERT. Specifically, pre-training 2018; Devlin et al., 2018; Baevski et al., 2019; obtains more flat and wider optima, which indi- Dong et al., 2019). A neural encoder is trained cates the pre-trained model tends to generalize bet- on a large text corpus by using language model- ter on unseen data (Chaudhari et al., 2017; Li et al., ing objectives. Then the pre-trained model either 2018; Izmailov et al., 2018). Additionally, we is used to extract vector representations for input, find that the training loss surface correlates well or is fine-tuned on the specific datasets. with the generalization error. Third, we demon- Recent work (Tenney et al., 2019b; Liu et al., strate that the lower (i.e., close to input) layers 2019a; Goldberg, 2019; Tenney et al., 2019a) of BERT are more invariant across tasks than the has shown that the pre-trained models can en- higher layers, which suggests that the lower layers code syntactic and semantic information of lan- learn transferable representations of language. We guage. However, it is unclear why pre-training verify the point by visualizing the loss landscape ∗Contribution during internship at Microsoft Research. with respect to different groups of layers. 4143 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4143–4152, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics 2 Background: BERT where α is a scalar parameter, δ1 = θ1 − θ0 is the optimization direction, and J (θ) is the loss func- We use BERT (Bidirectional Encoder Represen- tion under the model parameters θ. In our experi- tations from Transformers; Devlin et al. 2018) as ments, we set the range of α to [−4; 4] and sample an example of pre-trained language models in our 40 points for each axis. Note that we only consider experiments. BERT is pre-trained on a large cor- the parameters of BERT in θ0 and θ1, so δ1 only pus by using the masked language modeling and indicates the updates of the original BERT param- next-sentence prediction objectives. Then we can eters. The effect of the added task-specific layers add task-specific layers to the BERT model, and is eliminated by keeping them fixed to the learned fine-tune all the parameters according to the down- values. stream tasks. BERT employs a Transformer (Vaswani et al., 3.2 Two-dimensional Loss Surface 2017) network to encode contextual information, The one-dimensional loss curve can be extended which contains multi-layer self-attention blocks. to the two-dimensional (2D) loss surface (Li et al., jxj Given the embeddings fxigi=1 of input text, we 2018). Similar as in Equation (1), we need to de- 0 concatenate them into H = [x1; ··· ; xjxj]. Then, fine two directions (δ1 and δ2) as axes to plot the an L-layer Transformer encodes the input: Hl = loss surface: l−1 Transformer blockl(H ), where l = 1; ··· ;L, L L L f(α; β) = J (θ0 + αδ1 + βδ2) (2) and H = [h1 ; ··· ; hjxj]. We use the hidden vec- L where α; β are scalar values, J (·) is the loss func- tor hi as the contextualized representation of the θ input token xi. For more implementation details, tion, and 0 represents the initialized parameters. we refer readers to Vaswani et al.(2017). Similar to Section 3.1, we are only interested in the parameter space of the BERT encoder, with- 3 Methodology out taking into consideration task-specific layers. One of the axes is the optimization direction δ1 = We employ three visualization methods to under- θ1 − θ0 on the target dataset, which is defined in stand why fine-tuning the pre-trained BERT model the same way as in Equation (1). We compute can achieve better performance on downstream the other axis direction via δ2 = θ2 − θ0, where tasks compared with training from scratch. We θ2 represents the fine-tuned parameters on another plot both one-dimensional and two-dimensional dataset. So the other axis is the optimization di- loss landscapes of BERT on the specific datasets. rection of fine-tuning on another dataset. Even Besides, we project the optimization trajectories though the other dataset is randomly chosen, ex- of the fine-tuning procedure to the loss surface. perimental results confirm that the optimization The visualization algorithms can also be used for directions δ1; δ2 are divergent and orthogonal to the models that are trained from random initial- each other because of the high-dimensional pa- ization, so that we can compare the difference be- rameter space. tween two learning paradigm. The direction vectors δ1 and δ2 are projected onto a two-dimensional plane. It is beneficial to 3.1 One-dimensional Loss Curve ensure the scale equivalence of two axes for visu- Let θ0 denote the initialized parameters. For fine- alization purposes. Similar to the filter normal- tuning BERT, θ0 represents the the pre-trained pa- ization approach introduced in (Li et al., 2018), rameters. For training from scratch, θ0 repre- we address this issue by normalizing two direc- sents the randomly initialized parameters. After tion vectors to the same norm. We re-scale δ2 fine-tuning, the model parameters are updated to to kδ1k δ , where k·k denotes the Euclidean norm. kδ2k 2 θ1. The one-dimensional (1D) loss curve aims to We set the range of both α and β to [−4; 4] and quantify the loss values along the optimization di- sample 40 points for each axis. rection (i.e., from θ0 to θ1). The loss curve is plotted by linear interpola- 3.3 Optimization Trajectory tion between θ0 and θ1 (Goodfellow and Vinyals, Our goal is to project the optimization trajectory of 2015). The curve function f(α) is defined as: the fine-tuning procedure onto the 2D loss surface α β T obtained in Section 3.2. Let f(di ; di )gi=1 de- f(α) = J (θ0 + αδ1) (1) note the projected optimization trajectory, where 4144 α β (di ; di ) is a projected point in the loss surface, 1e-5. For MNLI and SST-2, the batch size is 64, and i = 1; ··· ;T represents the i-th epoch of fine- and the learning rate is 3e-5. tuning. For the setting of training from scratch, we As shown in Equation (2), we have known the use the same network architecture as BERT, and optimization direction δ1 = θ1 − θ0 on the tar- randomly initialize the model parameters. Most get dataset. We can compute the deviation degrees hyper-parameters are kept the same. The num- between the optimization direction and the trajec- ber of training epochs is larger than fine-tuning tory to visualize the projection results. Let θi de- BERT, because training from scratch requires note the BERT parameters at the i-th epoch, and more epochs to converge. The number of epochs i i δ = θ − θ0 denote the optimization direction at is set to 8 for SST-2, and 16 for the other datasets, α β the i-th epoch.

Visualizing and Understanding the Effectiveness of BERT

A Simple Algorithm for Semi-Supervised Learning with Improved Generalization Error Bound

Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks

Multitask Learning with Local Attention for Tibetan Speech Recognition

Lecture 9: Generalization

Issues in Using Function Approximation for Reinforcement Learning

Effective Dimensionality Revisited

Generalizing to Unseen Domains: a Survey on Domain Generalization

Batch Policy Learning Under Constraints

Learning Better Structured Representations Using Low-Rank Adaptive Label Smoothing

Learning Universal Graph Neural Network Embeddings with Aid of Transfer Learning

1 Introduction 2 the Learning Problem

Principles and Algorithms for Forecasting Groups of Time Series: Locality and Globality