Fine-Tuning BERT with Isotropic Batch Normalization

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization Wenxuan Zhou, Bill Yuchen Lin, Xiang Ren Department of Computer Science, University of Southern California, Los Angeles, CA fzhouwenx, yuchen.lin, [email protected] Abstract Fine-tuning pre-trained language models (PTLMs), such as BERT and its better variant RoBERTa, has been a common practice for advancing performance in natural language un- Classifier derstanding (NLU) tasks. Recent advance in representation PTLM learning shows that isotropic (i.e., unit-variance and uncorrelated) embeddings can significantly improve performance on downstream tasks with faster convergence and better gen- eralization. The isotropy of the pre-trained embeddings in Figure 1: Illustration of the isotropic batch normalization PTLMs, however, is relatively under-explored. In this paper, (IsoBN). The [CLS] embedding is normalized by standard we analyze the isotropy of the pre-trained [CLS] embeddings deviation and pairwise correlation coefficient to get a more of PTLMs with straightforward visualization, and point out isotropic representation. two major issues: high variance in their standard deviation, and high correlation between different dimensions. We also propose a new network regularization method, isotropic batch 1 normalization (IsoBN) to address the issues, towards learning to transform input features into normalized , uncorrelated more isotropic representations in fine-tuning by dynamically representations for faster convergence and better generaliza- penalizing dominating principal components. This simple yet tion ability. effective fine-tuning method yields about 1.0 absolute incre- It, however, remains an open question that how isotropic ment on the average of seven NLU tasks. the representations of PTLMs are. Particularly, we want to understand the isotropy of pre-trained [CLS] embeddings in Introduction PTLMs, and how we can improve it towards better fine-tuning for downstream tasks. In this paper, we first argue the rea- Pre-trained language models (PTLMs), such as BERT (De- son why we want more isotropic embeddings for the [CLS] vlin et al. 2019) and RoBERTa (Liu et al. 2019b), have tokens (Section ). Our analysis revels that the dominating revolutionized the area of natural language understanding principle components largely hinder the fine-tuning process (NLU). Fine-tuning PTLMs has advanced performance on to use knowledge in other components, due to the lack of many benchmark NLU datasets such as GLUE (Wang et al. isotropy. Then, we analyze the isotropy of the pre-trained 2019a). The most common fine-tuning method is to continue [CLS] embeddings. There are two essential aspects of an training pre-trained model parameters together with a few isotropic embedding space: unit-variance and uncorrelated- additional task-specific layers. The PTLMs and task-specific ness. Thus, we start our analysis by visualizing the standard layers are usually connected by the embeddings of [CLS] deviation and Pearson correlation coefficient of pre-trained tokens, which are regarded as sentence representations. [CLS] embeddings in BERT and RoBERTa on several NLU Recent works on text representation (Arora, Liang, and datasets. Ma 2017; Mu and Viswanath 2018; Gao et al. 2019; Wang Our visualization and quantitative analysis in Section finds et al. 2020) have shown that regularizing word embeddings to that: 1) the [CLS] embeddings have very different variance isotropic be more (i.e., rotational invariant) can significantly (Sec. ); 2) the [CLS] embeddings construct a few large clus- improve their performance on downstream tasks. An ideally ters of dimensions that are highly correlated with each other isotropic embedding space has two major merits: a) all di- (Sec. ). Both findings indicate that pre-trained contextualized same variance mensions have the and b) all dimensions are word embeddings are far from being isotropic, i.e., normal- uncorrelated with each other. These findings align with con- ized and uncorrelated. Therefore, these undesired prior bias feature normalization ventional techniques (Cogswell et al. from PTLMs may result in sub-optimal performance in fine- 2016; Ioffe and Szegedy 2015; Huang et al. 2018), which aim tuning for target tasks. Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1We use ‘normalized’ to refer unit-variance in this paper. 14621 Given that pre-trained [CLS] embeddings are very anisotropic, a natural research question is then: how can we regularize the fine-tuning process towards more isotropic cos(Winit, W) ||Winit W||2 / ||Winit||2 embeddings? There are two common methods for improving 1.0000 3% the isotropy of feature representations: whitening transfor- 0.9999 mation and batch normalization (Ioffe and Szegedy 2015). 2% However, both are not practically suitable in the scenario 0.9998 of fine-tuning PTLMs. Whitening transformation requires 0.9997 1% calculating the inverse of the covariance matrix, which are 0.9996 ill-conditioned in PTLMs’ embeddings. Unfortunately, cal- 0% culating the inverse is thus numerically unstable, computa- 0 2000 4000 0 2000 4000 tionally expensive, and incompatible in half-precision train- Step ing. Batch normalization is proposed to alleviate the inverse- computation issue by assuming that the covariance matrix is Figure 2: Average cosine similarity and L2 distance of fine- diagonal, which in turn completely ignores the influence of tuned weight W to the initialized weight Winit during the correlation between dimensions. entire training process. Both measures suggest that the change of weight is very subtle. Motivated by the research question and limitations of exist- ing works, we propose a new network regularization method, isotropic batch normalization (IsoBN) in Section . As shown d×c in Fiure 1, the proposed method is based on our observa- where W 2 R is a random-initialized learnable param- tion that the embedding dimensions can be seen as several eter. It learns towards mapping the underlying features ex- groups of highly-correlated dimensions. Our intuition is thus tracted by PTLMs into target classes for input examples. to assume that the absolute correlation coefficient matrix is Previous work (Dodge et al. 2020) has shown that initial- a block-diagonal binary matrix, instead of only a diagonal ized weight Winit of classifier has a large impact on the model matrix. The dimensions of the same group have an absolute performance. We further find that the final converged weights correlation coefficient of 1 (duplicate of each other), and nearly remain the same to be the initialization after fine- dimensions in different group of 0 (uncorrelated). tuning. We visualize this surprising phenomenon in Figure 2. This method greatly reduces the computation efforts in We first project both Winit and W to the subspace spanned calculating the inverse, and better models the characteris- by the top 10 eigenvectors of Cov (h) to remove the unim- tics of the pre-trained [CLS] embeddings. Our experiments portant components, then use two similarity metrics (cosine (Sec. ) show that the IsoBN indeed improves both BERT and similarity and L2 distance) to measure the difference between RoBERTa in fine-tuning, yielding about 1.0 absolute incre- the initialized weight Winit and the fine-tuned weight W . We ment on average of a wide range of 7 GLUE benchmark tasks. observe that the cosine similarity between Winit and W is We also empirically analyze the isotropy increment brought extremely close to 1, and their L2 distance is close to 0. It by IsoBN via explained variance, which clearly shows that suggests that the weight W of the classifier in the fine-tuned IsoBN produces much more isotropic embeddings than con- model is almost determined in initialization. For example, ventional batch normalization method. there is a 0.9997 cosine similarity between initialized and fine-tuned weights on COLA with RoBERTa. To the best of our knowledge, this work is the first one in studying the isotropy of the pre-trained [CLS] embeddings. Dominating Principle Components. Knowing that the We believe our findings and the proposed IsoBN method will weight of classifier is nearly fixed, we infer that the classifier inspire interesting future research directions in improving may not capture the discriminative information for classifica- pre-training language models as well as better fine-tuning tion during fine-tuning. We measure the informativeness of towards more isotropy of PTLMs. each principal component, by comparing the variance of logits produced by it (between the fine-tuned classifier and the Why Isotropic [CLS] Embeddings? optimal classifier). Specifically, we fit a logistic regression model on the entire training data using the scikit-learn (Pe- We formally show our analysis on the principle components dregosa et al. 2011) framework to get the optimal classifier. th of the [CLS] embeddings. Our findings reveal that with ran- The variance Vari of logits by the i principal component T 2 dom weight initialization, the first few principle components (wi; vi) is calculated by: Vari = wi ·(W vi) , where wi, vi usually take the majority of contributions for the prediction are the ith eigenvalue and eigenvector. Intuitively, a decent results. classifier should maximize the variance along informative principal components and minimize irrelevant ones. We show Background knowledge. For text classification, the in- the average proportion of variance in Figure 3. put

Load more