IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization

Wenxuan Zhou, Bill Yuchen Lin, Xiang Ren Department of Computer Science, University of Southern California, Los Angeles, CA {zhouwenx, yuchen.lin, xiangren}@usc.edu

Abstract Fine-tuning pre-trained language models (PTLMs), such as BERT and its better variant RoBERTa, has been a common practice for advancing performance in natural language un- Classifier derstanding (NLU) tasks. Recent advance in representation PTLM learning shows that isotropic (i.e., unit- and uncor- related) embeddings can significantly improve performance on downstream tasks with faster convergence and better gen- eralization. The isotropy of the pre-trained embeddings in Figure 1: Illustration of the isotropic batch normalization PTLMs, however, is relatively under-explored. In this paper, (IsoBN). The [CLS] embedding is normalized by standard we analyze the isotropy of the pre-trained [CLS] embeddings deviation and pairwise correlation coefficient to get a more of PTLMs with straightforward visualization, and point out isotropic representation. two major issues: high variance in their standard deviation, and high correlation between different dimensions. We also propose a new network regularization method, isotropic batch 1 normalization (IsoBN) to address the issues, towards learning to transform input features into normalized , uncorrelated more isotropic representations in fine-tuning by dynamically representations for faster convergence and better generaliza- penalizing dominating principal components. This simple yet tion ability. effective fine-tuning method yields about 1.0 absolute incre- It, however, remains an open question that how isotropic ment on the average of seven NLU tasks. the representations of PTLMs are. Particularly, we want to understand the isotropy of pre-trained [CLS] embeddings in 1 Introduction PTLMs, and how we can improve it towards better fine-tuning for downstream tasks. In this paper, we first argue the rea- Pre-trained language models (PTLMs), such as BERT (De- son why we want more isotropic embeddings for the [CLS] vlin et al. 2019) and RoBERTa (Liu et al. 2019b), have tokens (Section 2). Our analysis revels that the dominating revolutionized the area of natural language understanding principle components largely hinder the fine-tuning process (NLU). Fine-tuning PTLMs has advanced performance on to use knowledge in other components, due to the lack of many benchmark NLU datasets such as GLUE (Wang et al. isotropy. Then, we analyze the isotropy of the pre-trained 2018a). The most common fine-tuning method is to continue [CLS] embeddings. There are two essential aspects of an training pre-trained model parameters together with a few isotropic embedding space: unit-variance and uncorrelated- additional task-specific layers. The PTLMs and task-specific ness. Thus, we start our analysis by visualizing the standard [CLS] arXiv:2005.02178v2 [cs.CL] 4 Feb 2021 layers are usually connected by the embeddings of deviation and Pearson correlation coefficient of pre-trained tokens, which are regarded as sentence representations. [CLS] embeddings in BERT and RoBERTa on several NLU Recent works on text representation (Arora, Liang, and Ma datasets. 2016; Mu, Bhat, and Viswanath 2018; Gao et al. 2019; Wang Our visualization and quantitative analysis in Section 3 et al. 2020) have shown that regularizing word embeddings to finds that: 1) the [CLS] embeddings have very different isotropic be more (i.e., rotational invariant) can significantly variance (Sec. 3.1); 2) the [CLS] embeddings construct a improve their performance on downstream tasks. An ideally few large clusters of dimensions that are highly correlated isotropic embedding space has two major merits: a) all di- with each other (Sec. 3.2). Both findings indicate that pre- same variance mensions have the and b) all dimensions are trained contextualized word embeddings are far from being uncorrelated with each other. These findings align with con- isotropic, i.e., normalized and uncorrelated. Therefore, these feature normalization ventional techniques (Cogswell et al. undesired prior bias from PTLMs may result in sub-optimal 2015; Ioffe and Szegedy 2015; Huang et al. 2018), which aim performance in fine-tuning for target tasks. Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1We use ‘normalized’ to refer unit-variance in this paper. Given that pre-trained [CLS] embeddings are very CoLA MRPC RTE anisotropic, a natural research question is then: how can we regularize the fine-tuning process towards more isotropic cos(Winit, W) ||Winit W||2 / ||Winit||2 embeddings? There are two common methods for improving 1.0000 3% the isotropy of feature representations: whitening transfor- 0.9999 mation and batch normalization (Ioffe and Szegedy 2015). 2% However, both are not practically suitable in the scenario 0.9998 of fine-tuning PTLMs. Whitening transformation requires 0.9997 1% calculating the inverse of the matrix, which are 0.9996 ill-conditioned in PTLMs’ embeddings. Unfortunately, cal- 0% culating the inverse is thus numerically unstable, computa- 0 2000 4000 0 2000 4000 tionally expensive, and incompatible in half-precision train- Step ing. Batch normalization is proposed to alleviate the inverse- Figure 2: Average cosine similarity and L2 distance of computation issue by assuming that the is fine-tuned weight W to the initialized weight W dur- diagonal, which in turn completely ignores the influence of init ing the entire training process. Both measures suggest that correlation between dimensions. the change of weight is very subtle. Motivated by the research question and limitations of exist- ing works, we propose a new network regularization method, isotropic batch normalization (IsoBN) in Section 4. As shown d×c in Fiure 1, the proposed method is based on our observa- where W ∈ R is a random-initialized learnable param- tion that the embedding dimensions can be seen as several eter. It learns towards mapping the underlying features ex- groups of highly-correlated dimensions. Our intuition is thus tracted by PTLMs into target classes for input examples. to assume that the absolute correlation coefficient matrix is Previous work (Dodge et al. 2020) has shown that initial- a block-diagonal binary matrix, instead of only a diagonal ized weight Winit of classifier has a large impact on the model matrix. The dimensions of the same group have an absolute performance. We further find that the final converged weights correlation coefficient of 1 (duplicate of each other), and nearly remain the same to be the initialization after fine- dimensions in different group of 0 (uncorrelated). tuning. We visualize this surprising phenomenon in Figure 2. This method greatly reduces the computation efforts in We first project both Winit and W to the subspace spanned calculating the inverse, and better models the characteris- by the top 10 eigenvectors of Cov (h) to remove the unim- tics of the pre-trained [CLS] embeddings. Our experiments portant components, then use two similarity metrics (cosine (Sec. 5) show that the IsoBN indeed improves both BERT and similarity and L2 distance) to measure the difference between RoBERTa in fine-tuning, yielding about 1.0 absolute incre- the initialized weight Winit and the fine-tuned weight W . We ment on average of a wide range of 7 GLUE benchmark tasks. observe that the cosine similarity between Winit and W is We also empirically analyze the isotropy increment brought extremely close to 1, and their L2 distance is close to 0. It by IsoBN via explained variance, which clearly shows that suggests that the weight W of the classifier in the fine-tuned IsoBN produces much more isotropic embeddings than con- model is almost determined in initialization. For example, ventional batch normalization method. there is a 0.9997 cosine similarity between initialized and fine-tuned weights on COLA with RoBERTa. To the best of our knowledge, this work is the first one in studying the isotropy of the pre-trained [CLS] embeddings. Dominating Principle Components. Knowing that the We believe our findings and the proposed IsoBN method will weight of classifier is nearly fixed, we infer that the classifier inspire interesting future research directions in improving may not capture the discriminative information for classifica- pre-training language models as well as better fine-tuning tion during fine-tuning. We measure the informativeness of towards more isotropy of PTLMs. each principal component, by comparing the variance of log- its produced by it (between the fine-tuned classifier and the [CLS] optimal classifier). Specifically, we fit a logistic regression 2 Why isotropic embeddings? model on the entire training data using the scikit-learn (Pe- We formally show our analysis on the principle components dregosa et al. 2011) framework to get the optimal classifier. th of the [CLS] embeddings. Our findings reveal that with ran- The variance Vari of logits by the i principal component T 2 dom weight initialization, the first few principle components (wi, vi) is calculated by: Vari = wi ·(W vi) , where wi, vi usually take the majority of contributions for the prediction are the ith eigenvalue and eigenvector. Intuitively, a decent results. classifier should maximize the variance along informative principal components and minimize irrelevant ones. We show Background knowledge. For text classification, the in- the average proportion of variance in Figure 3. put text x is first encoded by the PTLM to feature h (i.e., As shown in our experiments (Sec.5), the top principal their [CLS] embeddings), and then classified by a random- components are constantly exaggerated by the fine-tuned initialized softmax layer: weights, especially at the beginning of training process. The T first principal component accounts for over 90% variance exp(Wi h) h = PTLM(x); pi = Pc T , of logits throughout the training, which thus hinder learning j=1 exp(Wj h) from other useful components. This motivates us to penal- BERT: MRPC BERT: RTE BERT: CoLA Varmodel Varopt 40 40 40

Top 1 Top 5 Top 30 30 30 30 1 20 20 20

10 10 10 RTE

0 10 10 10 7 10 4 10 1 10 2 10 1 100 10 4 10 3 10 2 10 1 100 0 1000 0 1000 0 1000 RoBERTa: MRPC RoBERTa: RTE RoBERTa: CoLA 10 10 10 1

% of Dimensions 8 8 8

6 6 6

MRPC 4 4 4 0 0 1000 2000 0 1000 2000 0 1000 2000 2 2 2

2 1 2 1 2 1 1 10 10 10 10 10 10 Std of the Coordinates of [CLS] Embedding

CoLA Figure 4: The distribution of the standard deviation (std) 0 [CLS] 0 2000 4000 0 2000 4000 0 2000 4000 of pre-trained embeddings. We show the results of BERT-base and RoBERTa-Large on 4 NLU datasets. Note Figure 3: The average percentage of variance of logit pro- that an (nearly) isotropic embedding space should have (al- duced by the top 1, 5, and 30 principal components. The most) zero variance on the std (i.e., 100% dimensions have top principal components are consistently exaggerated. the same std). ize the top principal components dynamically during the evidently better than BERT from this perspective (e.g., usu- fine-tuning for avoid losing knowledge of PTLMs due to ally ranging in [0.01, 1]). However, the [CLS] embeddings dominating components. of RoBERTa are still far from being isotropic, as there is no significantly dominant centering standard deviation value. 3 How isotropic are [CLS] embeddings? 3.2 Analysis of Correlation Coefficient The [CLS] embeddings of PTLMs, regarded as sentence representations, are directly used for fine-tuning (e.g., BERT Correlation between different dimensions of [CLS] embed- and RoBERTa) towards a wide range of downstream tasks. dings is an essential aspect of isotropy. Embeddings with low Given its impact in fine-tuning, we want to understand their correlation between dimensions usually show better general- isotropy. As we know unit-variance and uncorrelatedness ization on downstream applications (Cogswell et al. 2015). are two essential aspects of an isotropy space, we start our It, however, is relatively ignored by many neural network investigation by analyzing the standard deviation and Pearson regularization methods, such as batch normalization (Ioffe correlation coefficient. and Szegedy 2015). Specifically, we take the corpus of four popular NLU tasks In order to better visualize the Pearson correlation coeffi- (MRPC, RTE, COLA, and STS-b) from the GLUE bench- cient of [CLS] embeddings, we cluster the dimensions by mark datasets (Wang et al. 2018b) and then analyze their their pairwise coefficient, and then re-arrange the dimension pre-trained [CLS] embeddings in terms of standard devia- index, such that highly correlated dimensions locate near tion (Section 3.1) and correlation (Section 3.2) respectively. each other. The absolute value of correlations are shown in Figure 5, where darker cells means higher correlation. 3.1 Analysis of Standard Deviation We can see that both BERT and RoBERTa usually have To visualize the standard deviation of embeddings, we take very high correlations between different dimensions (i.e., the input sentences of the whole training dataset to PTLMs, most cells are in dark blue), although the situation is less then calculate the standard deviation on their [CLS] embed- severe in a few cases such as BERT on CoLA and RoBERTa dings, and finally obtain the distribution. on RTE. We find that BERT’s embeddings have several large The standard deviation of a nearly isotropic embedding clusters of correlated features, while RoBERTa tends to have space should concentrate on a very small range of values. a single extreme large cluster. Simply put, an ideally isotropic embedding space should In either case, such high correlation between embedding have a small variance of the distribution of their standard dimensions is harmful to future fine-tuning. Recall that the deviation, i.e., all dimensions of [CLS] embeddings should [CLS] embeddings are usually connected to a linear clas- have almost the same standard deviation. sifier which is uniformly initialized. In the beginning of the As shown in Figure 4, we can see that both BERT and fine-tuning process, the classifier will be biases to these fea- RoBERTa do not have such desired property for pre-trained tures since they gain more importance in back-propagation. [CLS] embeddings. The standard deviations of the embed- This undesired prior prevents models to exploit other poten- dings vary in a very wide range of values (e.g., [10−10, 1] in tially value features, and thus require more training data or BERT for MRPC). Interestingly, we can see that RoBERTa is epochs to converge and generalize in downstream tasks. Real Correlation BN Assumption IsoBN Assumption 1.0

0.8

0.6

0.4

0.2

0.0

Figure 6: Illustration of the assumption by batch normal- ization and our IsoBN with the reference of real correla- tion. IsoBN assumes that the absolute correlation matrix is block-diagonal while batch normalization completely ignores the correlation.

Figure 5: Absolute Pearson correlation coefficients be- Mu, Bhat, and Viswanath 2018). However, these methods tween dimensions of pre-trained [CLS] embeddings. We are hard to apply in fine-tuning PTLMs, as they require cal- show the results of BERT-base-cased and RoBERTa-Large culating the inverse of the covariance matrix. As shown in on four NLU datasets. Note that the dimension indexes of Section 3.2, the embeddings in PTLMs contains groups of the matrices are re-arranged by the clustering results. Ideally, highly-correlated dimensions. Therefore, the covariance ma- an isotropic embedding space should be 1 (darkest blue) on trices are ill-conditioned, and calculating the inverse is thus the diagonal and 0 on (white) other cells. A dark block in numerically unstable. It is also computationally expensive a matrix means a cluster of features highly correlated with and incompatible in half-precision training. each other. Batch normalization (BN) aims to simplify the inverse- computation problem by assuming that the covariance matrix is diagonal, thus the whitening function becomes: We argue that these two findings together indicate that −1 T pre-trained language models are far from being isotropic (i.e., hb = Λ (h − µ · 1 ), (2) normalized and uncorrelated), and thus undesired prior bias where Λ = diag(σ1, ..., σd) is a diagonal matrix consisting may result in sub-optimal model performance for fine-tuning. the standard deviation of each input dimension. Batch normal- ization greatly improves the stability and model performance 4 Approach in training deep neural networks. Based on our analysis in Section 3, we propose a new reg- However, it completely ignores the influence of correlation ularization method, isotropic batch normalization towards in the embeddings, and thus not suitable for our interested learning more isotropic representations of the [CLS] tokens [CLS] embeddings, where high correlation is a critical issue and thus better fine-tuning PTLMs. We first introduce some that needs to be addressed. We seek to design a novel nor- background knowledge about whitening and conventional malization method specially for fine-tuning PTLMs, which batch normalization methods (Section 4.1), then formally can be efficiently computed yet still improve representations introduce the proposed IsoBN (Section 4.2), and finally show towards isotropy property. the implementation details. 4.2 Isotropic Batch Normalization 4.1 Whitening and Batch Normalization Recall Figure 5, from the correlation matrix of pre-trained To improve the isotropy of feature representations, there are embeddings, we observe that on most datasets, the correla- two widely-used methods: 1) whitening transformation and tion matrix is nearly block-diagonal2. That is, the embedding 2) batch normalization (Ioffe and Szegedy 2015). dimensions form several clusters of highly-correlated dimen- Whitening transformation changes the input vector into sions. Dimensions within the same cluster have an absolute a vector, and can be defined as a transformation correlation coefficient of nearly 1, while dimensions from function as follows: different clusters are almost uncorrelated. Inspired by this, we propose an enhanced simplification of the covariance matrix. − 1 T hb = Σ 2 (h − µ · 1 ), (1) We assume that the absolute correlation coefficient matrix is a block-diagonal binary matrix. That is, the embedding where Σ ∈ Rd×d is the covariance matrix of the input dimensions can be clustered into m groups G , ..., G , where h ∈ Rd×N , µ ∈ Rd is the mean of h. Thus, the transforma- 1 m dimensions of the same group have an absolute correlation tion is a mapping from Rd×N → Rd×N . This transforma- coefficient of 1 (duplicate of each other), and dimensions tion produces a perfectly isotropic embedding space, where in different group have a correlation coefficient of 0 (un- the dimensions are uncorrelated and have the same variance. correlated). This assumption is illustrate in Figure 6 as a It can be applied in either feature pre-processing (Rosipal et al. 2001) or neural network training (Huang et al. 2018). 2A block diagonal matrix is a block matrix that is a square A similar method is to remove a few top principal compo- matrix such that the main-diagonal blocks are square matrices and nents from the embedding space (Arora, Liang, and Ma 2016; all off-diagonal blocks are zero matrices. conceptual comparison. Comparing with the conventional Algorithm 1: IsoBN Transformation batch normalization, our assumption takes accounts of cor- Input: Embedding h over a mini-batch: relations and thus is an more accurate approximation of the B = {h }; moving covariance Σ; moving realistic correlation coefficient matrices. Thereby, instead of 1...m standard deviation σ; momentum α. whitening the correlation matrix, we want the influence of each group of dimensions similar in the fine-tuning process. Output: transformed embedding bh; updated Σ, σ. if training then We first normalize each dimension to unit-variance, similar 1 Pm to batch normalization, for convenience of further derivation. µB = m i=1 hi q 1 Pm 2 This makes the dimensions in the same group exactly same σB = (hi − µB) i ∈ G m i=1 to each other. Then, for dimension g(i), it is repeated 1 T ΣB = (h − µB) (h − µB) in embeddings by |Gg(i)| times. Therefore, the normalization m transformation becomes: σ = σ + α(σB − σ) Σ = Σ + α(ΣB − Σ) (i) 1 (i) T hb = (h − µi · 1 ). (3) ρ = Σ σ · |G | σσT i g(i) Compute γ by Eq. 4 The dimensions of embeddings, however, are not naturally Compute scaling factor θ by Eq. 6 and Eq. 7 separable into hard group divisions. Thus, we create a soft bh = θ h version of computing the size of a feature-group |Gg(i)| via the correlation coefficient matrix ρ: d ∼ X 2 5 Evaluation |Gg(i)| −→ γi = ρij. (4) j=1 In this section, we first present the setup of our experiments (i.e. the datasets, frameworks, and hyper-parameters), then This equation produces the same result as |Gg(i)| when our assumption holds in real correlation matrix. Finally, our trans- discuss the empirical results, and finally evaluate the isotropy formation can be written as: gain through the lens of explained variance. (i) 1 (i) T hb = (h − µi · 1 ). (5) 5.1 Experiment Setup σi · γi The major difference between our method and conven- Our implementation of PTLMs is based on HuggingFace tional batch normalization is the introduction of the γ term, Transformer (Wolf et al. 2019). The model is fine-tuned with AdamW (Loshchilov and Hutter 2017) optimizer using a as a way to explicitly consider correlation between feature −5 −5 −5 dimensions. As shown in our experiments (Section 5), γ can learning rate in the range of {1 × 10 , 2 × 10 , 5 × 10 } greatly improve the isotropy of embedding. We name our and batch size in {16, 32}. The learning rate is scheduled by proposed normalization method as isotropic batch normaliza- a linear warm-up (Goyal et al. 2017) for the first 6% of steps tion (IsoBN), as it is towards more isotropic representations followed by a linear decay to 0. The maximum number of during fine-tuning. training epochs is set to 10. For IsoBN, the momentum α is The IsoBN is applied right before the final classifier. In ex- set to 0.95, the  is set to 0.1, and the normalization strength periments, we find that a modified version of IsoBN achieves β is chosen in the range of {0.25, 0.5, 1}. better performance. The mean µ is highly unstable during We apply early stopping according to task-specific metrics training and does not affect the principal components of rep- on the dev set. We select the best combination of hyper- resentations, so we remove it in our implementation. The parameters on the dev set. We fine-tune the PTLMs with 5 scaling term (σ · γ)−1 has a small magnitude and damages different random seeds and report the median and standard the optimization of training loss, so we re-normalize it to deviation of metrics on the dev set. keep the sum of in transformed embeddings same as the original one. We further introduce a hyper-parameter β, 5.2 Experimental Results which controls the normalization strength, since it is shown We evaluate IsoBN on two PTLMs (BERT-base-cased and in Section 2 that the dominating eigenvalue problem varies RoBERTa-large) and seven NLU tasks from the GLUE from datasets. The modified IsoBN is written as: benckmark (Wang et al. 2018b). The experiments results −β θi = (σi · γi + ) , (6) are shown in Table 1. Using IsoBN improves the evalua- tion metrics on all datasets. The average score increases Pd σ2 θ¯ = i=1 i · θ, (7) by 1% for BERT-base and 0.8% for RoBERTa-large. For Pd σ2θ2 small datasets (MRPC, RTE, CoLA, and STS-B (Cer et al. i=1 i i 2017)), IsoBN obtains an average performance improvement ¯ hb = θ h. (8) of 1.6% on BERT and 1.3% on RoBERTa. For large datasets In the IsoBN, the calculation of the scaling factor relies (MNLI (Williams, Nangia, and Bowman 2018), QNLI (Ra- on the covariance and standard deviation statistics of the jpurkar et al. 2016), and SST-2 (Socher et al. 2013)), IsoBN embedding. We keep two moving average caches and update obtains an average performance improvement of 0.15% on them in training because the estimated statistics from a single BERT and 0.25% on RoBERTa. This experiment shows that batch are not accurate. The whole algorithm of IsoBN is by improving the isotropy of embeddings, our IsoBN results shown in Algorithm 1. in better fine-tuning performance. Method Avg MNLI QNLI RTE SST-2 MRPC CoLA STS-B BERT-base (ReImp) 81.37 83.83 (.07) 90.82 (.1) 67.87 (1.1) 92.43 (.7) 85.29 (.9) 60.72 (1.4) 88.64 (.7) BERT-base-IsoBN 82.36 83.91 (.1) 91.04 (.1) 70.75 (1.6) 92.54 (.1) 87.50 (.6) 61.59 (1.6) 89.19 (.7) RoBERTa-L (ReImp) 88.16 90.48 (.07) 94.70 (.1) 84.47 (1.0) 96.33 (.3) 90.68 (.9) 68.25 (1.1) 92.24 (.2) RoBERTa-L-IsoBN 88.98 90.69 (.05) 94.91 (.1) 87.00 (1.3) 96.67 (.3) 91.42 (.8) 69.70 (.8) 92.51 (.2)

Table 1: Empirical results on the dev sets of seven GLUE tasks. We run 5 times with different random seeds and report median and std. IsoBN outperforms the conventional fine-tuning method around 1.0 absolute increment.

EV1/EV2/EV3 MRPC RTE CoLA STS-b BERT-base 0.76 / 0.87 / 0.89 0.88 / 0.93 / 0.95 0.49 / 0.58 / 0.64 0.89 / 0.94 / 0.96 BERT-base+BN 0.74 / 0.84 / 0.86 0.70 / 0.89 / 0.93 0.37 / 0.59 / 0.63 0.69 / 0.88 / 0.92 BERT-base+IsoBN 0.37 / 0.68 / 0.77 0.49 / 0.72 / 0.85 0.25 / 0.37 / 0.48 0.41 / 0.69 / 0.85 RoBERTa-L 0.86 / 0.90 / 0.91 0.53 / 0.66 / 0.70 0.83 / 0.88 / 0.90 0.87 / 0.90 / 0.92 RoBERTa-L+BN 0.64 / 0.73 / 0.76 0.36 / 0.50 / 0.57 0.61 / 0.70 / 0.75 0.65 / 0.72 / 0.77 RoBERTa-L+IsoBN 0.18 / 0.36 / 0.43 0.15 / 0.29 / 0.37 0.21 / 0.38 / 0.49 0.17 / 0.32 / 0.45

Table 2: The explained variance on BERT-base and RoBERTa-large. Compared to batch normalization, our method can greatly reduce the explained variance and thus improve the isotropy of embeddings.

5.3 Experiments of EVk Metric EV3 value is around 0.86 for both BERT-base and RoBERTa- To quantitatively measure the isotropy of embeddings, we large. For some datasets (e.g. STS-b), the top three principal propose to use explained variance (EV) as the metric for components already explain over 90% of the variance. Batch isotropy, which is defined as: normalization can only reduce the EVk value by a small margin (0.025 for BERT and 0.148 for RoBERTa on EV3), Pk 2 i=1 λi because it ignores the correlations among embedding dimen- EVk(h) = d , (9) P 2 sions. Our proposed IsoBN greatly reduces the EVk value j=1 λj (0.123 for BERT and 0.425 for RoBERTa on EV3). N×d th where h ∈ R is the [CLS] embeddings, λi is the i We also visualize the distribution of EVk values in Fig- largest singular value of the matrix h. Note that N is the num- ure 7. We choose the first 50 EVk value for BERT-base and ber of sentences in a certain corpus, and d is the dimension first 200 EVk value for RoBERTa-large. We observe that of hidden states in the last layer of a pre-trained language with IsoBN, we can decrease the EVk value of pre-trained model. embeddings. This experiment shows that compared to batch This metric measures the difference of variance in differ- normalization, IsoBN can further improve isotropy of [CLS] ent directions of the embedding space. Intuitively, if the EVk embedding of PTLMs. value is small, the variations of embedding tend to distribute equally in all directions and lead to more angular symmetric 6 Related Work representation. If the EVk value is large, most of the vari- ations will concentrate on the first few directions, and the Input Normalization. embedding space will degrade to a narrow cone. Thus, the EVk is a good metric of the isotropy of embedding space. Normalizing inputs (Montavon and Muller¨ 2012; He et al. We use EVk as the isotropy metric because it enjoys two 2016; Szegedy et al. 2017) and gradients (Schraudolph 1998; beneficial properties: Bjorck, Gomes, and Selman 2018) has been known to be ben- • It is invariant to the magnitude of the embeddings, and eficial for training deep neural networks. Batch normaliza- thus comparisons between different models and datasets is tion (Ioffe and Szegedy 2015) normalizes neural activations more fair. to zero-mean and unit-variance using batch statistics (mean, standard deviation). It is both empirically and theoretically • It is also invariant to the mean value of the embeddings, proved that batch normalization can greatly smooth the loss aligning with sentence classification/regression tasks of landscape (Santurkar et al. 2018; Bjorck, Gomes, and Selman our interest. 2018; Ghorbani, Krishnan, and Xiao 2019), which leads to We compute the EVk metric on two PTLMs (BERT-base faster convergence and better generalization. One drawback and RoBERTa-large) and 4 tasks (MRPC, RTE, CoLA, STS- of batch normalization is that it ignores the correlations be- B). For IsoBN, the normalization strength β is 1. We show tween input dimensions. Some methods (Huang et al. 2018, the first three EVk value (EV1, EV2, and EV3) in Table 2. 2019) seek to calculate the full whitening transformation, We observe that before normalization, the pre-trained while the IsoBN greatly simplifies the process by using the [CLS] embeddings have very high EVk value. The average block-diagonal assumption, and succeeds to reduce the effect NoNorm BN IsoBN

BERT: MRPC BERT: RTE BERT: CoLA BERT: STS-b 1.00

0.75

0.50

0.25

0.00 )

H 0 25 50 0 25 50 0 25 50 0 25 50 ( K

V RoBERTa: MRPC RoBERTa: RTE RoBERTa: CoLA RoBERTa: STS-b E 1.00

0.75

0.50

0.25

0.00 0 100 200 0 100 200 0 100 200 0 100 200 K

Figure 7: The EVk value on BERT-base and RoBERTa-large with no normalization, batch normalization, and IsoBN. We choose the max K to be 50 for BERT and 200 for RoBERTa. Compared to batch normalization, IsoBN can greatly reduce the EVk value and thus improve the isotropy of [CLS] embeddings. of dominating principal components by simple scaling of contrast, we study the spectrum of [CLS] embedding in pre- input dimensions. trained language models, and show that controlling the top principal components can improve fine-tuning performance. Fine-tuning Language Models. Pre-trained language models (Devlin et al. 2019; Liu et al. 7 Conclusion 2019b) achieve the state-of-the-art performance on vari- ous natural language understanding tasks. To adapt the pre- Our major contributions in this paper are two-fold: trained language model to target tasks, the common practice, • We study the isotropy of the pre-trained [CLS] embed- in which a task-specific classifier is added to the network dings. Our analysis is based on straightforward visualiza- and jointly trained with PTLMs, already produces good re- tion about standard deviation and correlation coefficient. sults (Peters, Ruder, and Smith 2019). Some works show that with well-designed fine-tuning strategies, the model perfor- • The proposed regularization method, IsoBN, stably im- mance can be further improved, by adversarial training (Zhu proves the fine-tuning of BERT and RoBERTa towards et al. 2020; Jiang et al. 2019), gradual unfreezing (Howard more isotropic representations, yielding an absolute incre- and Ruder 2018; Peters, Ruder, and Smith 2019), or multi- ment around 1.0 point on 7 popular NLU tasks. tasking (Clark et al. 2019; Liu et al. 2019a). To our best We hope our work points to interesting future research direc- knowledge, this is the first work to study the isotropy of text tions in improving pre-training language models as well as representation for fine-tuning. better fine-tuning towards more isotropy of PTLMs. We will release our code at https://github.com/INK-USC/IsoBN/. Spectral Control. The spectrum of representation has been studied in many References subareas. On pre-trained word embeddings like Glove (Pen- nington, Socher, and Manning 2014), some work (Mu, Bhat, Arora, S.; Liang, Y.; and Ma, T. 2016. A simple but tough-to- and Viswanath 2018) shows that removing the top one prin- beat baseline for sentence embeddings. In ICLR. cipal component leads to better performance on text sim- Bjorck, J.; Gomes, C. P.; and Selman, B. 2018. Understanding ilarity tasks. On contextual embeddings like Elmo (Peters Batch Normalization. In NeurIPS. et al. 2018) and GPT (Radford 2018), spectrum is used as a measure of the ability to capture contextual knowledge (Etha- Cer, D. M.; Diab, M. T.; Agirre, E.; Lopez-Gazpio, I.; and yarajh 2019). On text generation, it is shown that large eigen- Specia, L. 2017. SemEval-2017 Task 1: Semantic Textual values hurt the expressiveness of representation and cause the Similarity Multilingual and Crosslingual Focused Evaluation. degradation problem (Gao et al. 2019; Wang et al. 2020). In ArXiv abs/1708.00055. Clark, K.; Luong, M.-T.; Khandelwal, U.; Manning, C. D.; Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay and Le, Q. V. 2019. BAM! Born-Again Multi-Task Networks regularization. arXiv preprint arXiv:1711.05101 . for Natural Language Understanding. ArXiv abs/1907.04829. Montavon, G.; and Muller,¨ K.-R. 2012. Deep Boltzmann Cogswell, M.; Ahmed, F.; Girshick, R. B.; Zitnick, C. L.; and Machines and the Centering Trick. In Neural Networks: Batra, D. 2015. Reducing Overfitting in Deep Networks by Tricks of the Trade. Decorrelating Representations. CoRR abs/1511.06068. Mu, J.; Bhat, S.; and Viswanath, P. 2018. All-but-the-Top: Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Simple and Effective Postprocessing for Word Representa- BERT: Pre-training of Deep Bidirectional Transformers for tions. ArXiv abs/1702.01417. Language Understanding. In NAACL-HLT. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, H.; and Smith, N. A. 2020. Fine-Tuning Pretrained Lan- R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; guage Models: Weight Initializations, Data Orders, and Early Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit- Stopping. ArXiv abs/2002.06305. learn: Machine Learning in Python. Journal of Machine Learning Research 12: 2825–2830. Ethayarajh, K. 2019. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: ELMo, and GPT-2 Embeddings. ArXiv abs/1909.00512. Global Vectors for Word Representation. In EMNLP. Gao, J.; He, D.; Tan, X.; Qin, T.; Wang, L.; and Liu, T.-Y. Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, 2019. Representation Degeneration Problem in Training Nat- C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized ArXiv ural Language Generation Models. ArXiv abs/1907.12009. word representations. abs/1802.05365. Peters, M. E.; Ruder, S.; and Smith, N. A. 2019. To Tune or Ghorbani, B.; Krishnan, S.; and Xiao, Y. 2019. An Investi- Not to Tune? Adapting Pretrained Representations to Diverse gation into Neural Net Optimization via Hessian Eigenvalue Tasks. In RepL4NLP@ACL. Density. In ICML. Radford, A. 2018. Improving Language Understanding by Goyal, P.; Dollar,´ P.; Girshick, R. B.; Noordhuis, P.; Generative Pre-Training. Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He, K. 2017. Accurate, Large Minibatch SGD: Training ImageNet Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. in 1 Hour. ArXiv abs/1706.02677. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Rosipal, R.; Girolami, M.; Trejo, L. J.; and Cichocki, A. 2001. conference on computer vision and pattern recognition, 770– Kernel PCA for feature extraction and de-noising in nonlinear 778. regression. Neural Computing & Applications 10(3): 231– 243. Howard, J.; and Ruder, S. 2018. Universal Language Model Santurkar, S.; Tsipras, D.; Ilyas, A.; and Madry, A. 2018. Fine-tuning for Text Classification. In ACL. How Does Batch Normalization Help Optimization? In Huang, L.; Yang, D.; Lang, B.; and Deng, J. 2018. Decorre- NeurIPS. 2018 IEEE/CVF Conference on lated Batch Normalization. Schraudolph, N. N. 1998. Accelerated Gradient Descent by Computer Vision and Pattern Recognition 791–800. Factor-Centering Decomposition. Huang, L.; Zhou, Y.; Zhu, F.; Liu, L.; and Shao, L. 2019. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Iterative Normalization: Beyond Standardization Towards Ef- Ng, A. Y.; and Potts, C. 2013. Recursive Deep Models for ficient Whitening. 2019 IEEE/CVF Conference on Computer Semantic Compositionality Over a Sentiment Treebank. In Vision and Pattern Recognition (CVPR) 4869–4878. EMNLP. Ioffe, S.; and Szegedy, C. 2015. Batch Normalization: Accel- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. 2017. erating Deep Network Training by Reducing Internal Covari- Inception-v4, Inception-ResNet and the Impact of Residual ate Shift. ArXiv abs/1502.03167. Connections on Learning. In AAAI. Jiang, H.; cheng He, P.; Chen, W.; Liu, X.; Gao, J.; and Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Zhao, T. 2019. SMART: Robust and Efficient Fine-Tuning Bowman, S. R. 2018a. GLUE: A Multi-Task Benchmark and for Pre-trained Natural Language Models through Principled Analysis Platform for Natural Language Understanding. In Regularized Optimization. ArXiv abs/1911.03437. BlackboxNLP@EMNLP. Liu, X.; He, P.; Chen, W.; and Gao, J. 2019a. Multi-Task Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Deep Neural Networks for Natural Language Understanding. Bowman, S. R. 2018b. Glue: A multi-task benchmark and In ACL. analysis platform for natural language understanding. arXiv Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M. S.; Chen, D.; preprint arXiv:1804.07461 . Levy, O.; Lewis, M.; Zettlemoyer, L. S.; and Stoyanov, V. Wang, L.; Huang, J.; Huang, K. Y.; Hu, Z.; Wang, G.; and 2019b. RoBERTa: A Robustly Optimized BERT Pretraining Gu, Q. 2020. Improving Neural Language Generation with Approach. ArXiv abs/1907.11692. Spectrum Control. In ICLR 2020. Williams, A.; Nangia, N.; and Bowman, S. R. 2018. A Broad-Coverage Challenge Corpus for Sentence Understand- ing through Inference. ArXiv abs/1704.05426. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; and Brew, J. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv abs/1910.03771. Zhu, C.; Cheng, Y.; Gan, Z.; Sun, S.; Goldstein, T.; and Liu, J. 2020. FreeLB: Enhanced Adversarial Training for Natural Language Understanding. In International Conference on Learning Representations.