Arxiv:2005.02178V2 [Cs.CL] 4 Feb 2021
Total Page:16
File Type:pdf, Size:1020Kb
IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization Wenxuan Zhou, Bill Yuchen Lin, Xiang Ren Department of Computer Science, University of Southern California, Los Angeles, CA fzhouwenx, yuchen.lin, [email protected] Abstract Fine-tuning pre-trained language models (PTLMs), such as BERT and its better variant RoBERTa, has been a common practice for advancing performance in natural language un- Classifier derstanding (NLU) tasks. Recent advance in representation PTLM learning shows that isotropic (i.e., unit-variance and uncor- related) embeddings can significantly improve performance on downstream tasks with faster convergence and better gen- eralization. The isotropy of the pre-trained embeddings in Figure 1: Illustration of the isotropic batch normalization PTLMs, however, is relatively under-explored. In this paper, (IsoBN). The [CLS] embedding is normalized by standard we analyze the isotropy of the pre-trained [CLS] embeddings deviation and pairwise correlation coefficient to get a more of PTLMs with straightforward visualization, and point out isotropic representation. two major issues: high variance in their standard deviation, and high correlation between different dimensions. We also propose a new network regularization method, isotropic batch 1 normalization (IsoBN) to address the issues, towards learning to transform input features into normalized , uncorrelated more isotropic representations in fine-tuning by dynamically representations for faster convergence and better generaliza- penalizing dominating principal components. This simple yet tion ability. effective fine-tuning method yields about 1.0 absolute incre- It, however, remains an open question that how isotropic ment on the average of seven NLU tasks. the representations of PTLMs are. Particularly, we want to understand the isotropy of pre-trained [CLS] embeddings in 1 Introduction PTLMs, and how we can improve it towards better fine-tuning for downstream tasks. In this paper, we first argue the rea- Pre-trained language models (PTLMs), such as BERT (De- son why we want more isotropic embeddings for the [CLS] vlin et al. 2019) and RoBERTa (Liu et al. 2019b), have tokens (Section 2). Our analysis revels that the dominating revolutionized the area of natural language understanding principle components largely hinder the fine-tuning process (NLU). Fine-tuning PTLMs has advanced performance on to use knowledge in other components, due to the lack of many benchmark NLU datasets such as GLUE (Wang et al. isotropy. Then, we analyze the isotropy of the pre-trained 2018a). The most common fine-tuning method is to continue [CLS] embeddings. There are two essential aspects of an training pre-trained model parameters together with a few isotropic embedding space: unit-variance and uncorrelated- additional task-specific layers. The PTLMs and task-specific ness. Thus, we start our analysis by visualizing the standard [CLS] arXiv:2005.02178v2 [cs.CL] 4 Feb 2021 layers are usually connected by the embeddings of deviation and Pearson correlation coefficient of pre-trained tokens, which are regarded as sentence representations. [CLS] embeddings in BERT and RoBERTa on several NLU Recent works on text representation (Arora, Liang, and Ma datasets. 2016; Mu, Bhat, and Viswanath 2018; Gao et al. 2019; Wang Our visualization and quantitative analysis in Section 3 et al. 2020) have shown that regularizing word embeddings to finds that: 1) the [CLS] embeddings have very different isotropic be more (i.e., rotational invariant) can significantly variance (Sec. 3.1); 2) the [CLS] embeddings construct a improve their performance on downstream tasks. An ideally few large clusters of dimensions that are highly correlated isotropic embedding space has two major merits: a) all di- with each other (Sec. 3.2). Both findings indicate that pre- same variance mensions have the and b) all dimensions are trained contextualized word embeddings are far from being uncorrelated with each other. These findings align with con- isotropic, i.e., normalized and uncorrelated. Therefore, these feature normalization ventional techniques (Cogswell et al. undesired prior bias from PTLMs may result in sub-optimal 2015; Ioffe and Szegedy 2015; Huang et al. 2018), which aim performance in fine-tuning for target tasks. Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1We use ‘normalized’ to refer unit-variance in this paper. Given that pre-trained [CLS] embeddings are very CoLA MRPC RTE anisotropic, a natural research question is then: how can we regularize the fine-tuning process towards more isotropic cos(Winit, W) ||Winit W||2 / ||Winit||2 embeddings? There are two common methods for improving 1.0000 3% the isotropy of feature representations: whitening transfor- 0.9999 mation and batch normalization (Ioffe and Szegedy 2015). 2% However, both are not practically suitable in the scenario 0.9998 of fine-tuning PTLMs. Whitening transformation requires 0.9997 1% calculating the inverse of the covariance matrix, which are 0.9996 ill-conditioned in PTLMs’ embeddings. Unfortunately, cal- 0% culating the inverse is thus numerically unstable, computa- 0 2000 4000 0 2000 4000 tionally expensive, and incompatible in half-precision train- Step ing. Batch normalization is proposed to alleviate the inverse- Figure 2: Average cosine similarity and L2 distance of computation issue by assuming that the covariance matrix is fine-tuned weight W to the initialized weight W dur- diagonal, which in turn completely ignores the influence of init ing the entire training process. Both measures suggest that correlation between dimensions. the change of weight is very subtle. Motivated by the research question and limitations of exist- ing works, we propose a new network regularization method, isotropic batch normalization (IsoBN) in Section 4. As shown d×c in Fiure 1, the proposed method is based on our observa- where W 2 R is a random-initialized learnable param- tion that the embedding dimensions can be seen as several eter. It learns towards mapping the underlying features ex- groups of highly-correlated dimensions. Our intuition is thus tracted by PTLMs into target classes for input examples. to assume that the absolute correlation coefficient matrix is Previous work (Dodge et al. 2020) has shown that initial- a block-diagonal binary matrix, instead of only a diagonal ized weight Winit of classifier has a large impact on the model matrix. The dimensions of the same group have an absolute performance. We further find that the final converged weights correlation coefficient of 1 (duplicate of each other), and nearly remain the same to be the initialization after fine- dimensions in different group of 0 (uncorrelated). tuning. We visualize this surprising phenomenon in Figure 2. This method greatly reduces the computation efforts in We first project both Winit and W to the subspace spanned calculating the inverse, and better models the characteris- by the top 10 eigenvectors of Cov (h) to remove the unim- tics of the pre-trained [CLS] embeddings. Our experiments portant components, then use two similarity metrics (cosine (Sec. 5) show that the IsoBN indeed improves both BERT and similarity and L2 distance) to measure the difference between RoBERTa in fine-tuning, yielding about 1.0 absolute incre- the initialized weight Winit and the fine-tuned weight W . We ment on average of a wide range of 7 GLUE benchmark tasks. observe that the cosine similarity between Winit and W is We also empirically analyze the isotropy increment brought extremely close to 1, and their L2 distance is close to 0. It by IsoBN via explained variance, which clearly shows that suggests that the weight W of the classifier in the fine-tuned IsoBN produces much more isotropic embeddings than con- model is almost determined in initialization. For example, ventional batch normalization method. there is a 0.9997 cosine similarity between initialized and fine-tuned weights on COLA with RoBERTa. To the best of our knowledge, this work is the first one in studying the isotropy of the pre-trained [CLS] embeddings. Dominating Principle Components. Knowing that the We believe our findings and the proposed IsoBN method will weight of classifier is nearly fixed, we infer that the classifier inspire interesting future research directions in improving may not capture the discriminative information for classifica- pre-training language models as well as better fine-tuning tion during fine-tuning. We measure the informativeness of towards more isotropy of PTLMs. each principal component, by comparing the variance of log- its produced by it (between the fine-tuned classifier and the [CLS] optimal classifier). Specifically, we fit a logistic regression 2 Why isotropic embeddings? model on the entire training data using the scikit-learn (Pe- We formally show our analysis on the principle components dregosa et al. 2011) framework to get the optimal classifier. th of the [CLS] embeddings. Our findings reveal that with ran- The variance Vari of logits by the i principal component T 2 dom weight initialization, the first few principle components (wi; vi) is calculated by: Vari = wi ·(W vi) , where wi, vi usually take the majority of contributions for the prediction are the ith eigenvalue and eigenvector. Intuitively, a decent results. classifier should maximize the variance along informative principal components and minimize irrelevant ones. We show Background knowledge. For text classification, the in- the average proportion of variance in Figure 3. put text