Learning Deep Architectures Via Generalized Whitened Neural Networks

Learning Deep Architectures via Generalized Whitened Neural Networks Ping Luo 1 2 Abstract approximation of the identity matrix. This is an appealing property, as training WNN using stochastic gradient Whitened Neural Network (WNN) is a recent descent (SGD) mimics the fast convergence of natural advanced deep architecture, which improves con- gradient descent (NGD) (Amari & Nagaoka, 2000). The vergence and generalization of canonical neural whitening transformation also improves generalization. As networks by whitening their internal hidden rep- demonstrated in (Desjardins et al., 2015), WNN exhib- resentation. However, the whitening transforma- ited superiority when being applied to various network tion increases computation time. Unlike WNN architectures, such as autoencoder and convolutional neural that reduced runtime by performing whitening network, outperforming many previous works including every thousand iterations, which degenerates SGD, RMSprop (Tieleman & Hinton, 2012), and BN. convergence due to the ill conditioning, we present generalized WNN (GWNN), which has Although WNN is able to reduce the number of training three appealing properties. First, GWNN is iterations and improve generalization, it comes with a price able to learn compact representation to reduce of increasing training time, because eigen-decomposition computations. Second, it enables whitening occupies large computations. The runtime scales up when transformation to be performed in a short period, the number of hidden layers that require whitening trans- preserving good conditioning. Third, we propose formation increases. We revisit WNN by breaking down a data-independent estimation of the covariance its performance and show that its main runtime comes matrix to further improve computational efficien- from two aspects, 1) computing full covariance matrix cy. Extensive experiments on various datasets for whitening and 2) solving singular value decomposition demonstrate the benefits of GWNN. (SVD). Previous work (Desjardins et al., 2015) suggests to overcome these problems by a) using a subset of training data to estimate the full covariance matrix and b) solving the SVD every hundreds or thousands of training iterations. 1. Introduction Both of them rely on the assumption that the SVD holds in Deep neural networks (DNNs) have improved perfor- this period, but it is generally not true. When this period mances of many applications, as the non-linearity of DNNs becomes large, WNN degenerates to canonical SGD due to provides expressive modeling capacity, but it also makes ill conditioning of FIM. DNNs difficult to train and easy to overfit the training data. We propose generalized WNN (GWNN), which possesses Whitened neural network (WNN) (Desjardins et al., 2015), the beneficial properties of WNN, but significantly reduces a recent advanced deep architecture, is ideally to solve the its runtime and improves its generalization. We introduce above difficulties. WNN extends batch normalization (BN) two variants of GWNN, including pre-whitening and post- (Ioffe & Szegedy, 2015) by normalizing the internal hidden whitening GWNNs. The former one whitens a hidden representation using whitening transformation instead of layer’s input values, whilst the latter one whitens the pre- standardization. Whitening helps regularize each diagonal activation values (hidden features). GWNN has three ap- block of the Fisher Information Matrix (FIM) to be an pealing characteristics. First, compared to WNN, GWNN is capable of learning more compact hidden representa- 1Guangdong Provincial Key Laboratory of Computer Vi- tion, such that the SVD can be approximated by a few sion and Virtual Reality Technology, Shenzhen Institutes of top eigenvectors to reduce computation. This compact Advanced Technology, Chinese Academy of Sciences, Shen- zhen, China 2Multimedia Laboratory, The Chinese University representation also improves generalization. Second, it of Hong Kong, Hong Kong. Correspondence to: Ping Luo enables the whitening transformation to be performed in <[email protected]>. a short period, maintaining conditioning of FIM. Third, th by exploiting knowledge of the distribution of the hidden Proceedings of the 34 International Conference on Machine features, we calculate the covariance matrix in an analytical Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). form to further improve computational efficiency. Generalized Whitened Neural Network i ～ i hiϕ oi-1 hî ϕ hî ϕi hî relu ～i-1 i od hi ϕ d d bn i-1 i i i-1 i-1 i i i-1 Pi-1 i i i-1 i i i o W o o P Wˆ o o d Wˆ o od Wˆ Pd od weight matrix whitening matrix (a) fully-connected layer (fc) (b) whitened fc layer of WNN (c) pre-whitening GWNN (d) post-whitening GWNN Figure 1. Comparisons of differnet architectures. An ordinary fully-connected (fc) layer can be adapted into (b) a whitened fc layer, (c) a pre-GWNN layer, and (d) a post-GWNN layer. (c) and (d) learn more compact representation than (b) does. 2. Notation and Background (a) into a whitened fc layer. Its information flow becomes We begin by defining the basic notation for feed-forward oi−1 = P i−1(oi−1 − µi−1); hî = W^ ioi−1; (1) neural network. A neural network transforms an input e e i hî i i i i 0 ` φ = p ; o = max(0; diag(α )φ + β ); vector o to an output vector o through a series of ` Var[hî] i ` hidden layers fo gi=1. We assume each layer has identical dimension for the simplicity of notation i.e. 8oi 2 Rd×1. where µi−1 represents a centering variable, µi−1 = In this case, all vectors and matrixes in the following E[oi−1]. P i−1 is a whitening matrix whose rows are should have d rows unless otherwise stated. As shown obtained from eigen-decomposition of Σi−1, which is in Fig.1 (a), each fully-connected (fc) layer consists of the covariance matrix of the input, Σi−1 = E[(oi−1 − a weight matrix, W i, and a set of hidden neurons, hi, µi−1)(oi−1 − µi−1)T]. The input is decorrelated by P i−1 each of which receives as input a weighted sum of outputs in the sense that its covariance matrix becomes an identity from the previous layer. We have hi = W ioi−1. In i−1 i−1T matrix, i.e. E[oe oe ] = I. To avoid ambiguity, we use this work, we take fully-connected network as an example. ‘^’ to distinguish the variables in WNN and the canonical Note that the above computation can be also applied to a fc layer whenever necessary. For instance, W^ i represents a convolutional network, where an image patch is vectorized whitened weight matrix. In Eqn.(1), computation of the as a column vector and represented by oi−1 and each row BN layer has been simplified because we have [hî] = i E of W represents a filter. W^ iP i−1(E[oi−1] − µi−1) = 0. As the recent deep architectures typically stack a batch We define θ to be a vector consisting of all the normalization (BN) layer before the pre-activation values, whitened weight matrixes concatenated together, θ = we do not explicitly include a bias term when computing fvec(W^ 1)T; vec(W^ 2)T; :::; vec(W^ `)Tg, where vec(·) is an hi, because it is normalized in BN, such that φi = operator that vectorizes a matrix by stacking its columns. i i ph −E[h ] , where the expectation and variance are computed Let L(o`; y; θ) denote a loss function of WNN, which i Var[h ] ` over a minibatch of samples. LeCun et al.(2002) showed measures the disagreement between a prediction o made that such normalization speeds up convergence even when by the network, and a target y. WNN is trained by the hidden features are not decorrelated. Furthermore, minimizing the loss function with respect to the parameter output of each layer is calculated by a nonlinear activation vector θ and two constraints function. A popular choice is the rectified linear unit, min L(o`; y; θ) (2) relu(x) = max(0; x). The precise computation for an θ i i i i output is o = max(0; diag(α )φ + β ), where diag(x) i−1 i−1T i i i s:t: E[o o ] = I; h − E[h ] = h^ ; i = 1:::`: represents a matrix whose diagonal entries are x. αi and βi e e are two vectors that scale and shift the normalized features, i−1 in order to maintain the network’s representation capacity. To satisfy the first constraint, P is obtained by decom- T posing the covariance matrix, Σi−1 = U i−1Si−1U i−1 . i−1 i−1 − 1 i−1T i−1 2.1. Whitened Neural Networks We choose P = (S ) 2 U , where S is a diagonal matrix whose diagonal elements are eigenvalues This section revisits whitened neural networks (WNN). and U i−1 is an orthogonal matrix of eigenvectors. The Any neural architecture can be adapted to a WNN by first constraint holds under the construction of eigen- stacking a whitening transformation layer after the layer’s decomposition. input. For example, Fig.1 (b) adapts a fc layer as shown in The second constraint, hi − E[hi] = hî, enforces that the centered hidden features are the same, before and after Generalized Whitened Neural Network adapting a fc layer to WNN, as shown in Fig.1 (a) and (b). Algorithm 1 Training WNN In other words, it ensures that their representation powers 1: Init: initial network parameters θ, αi, βi; whitening matrix i−1 ^ i i are identical. By combing the computations in Fig.1 (a) and P = I; iteration t = 0; Wt = W ; 8i 2 f1:::`g. Eqn.(1), the second constraint implies that k(hi − E[hi]) − 2: repeat î 2 i i−1 i i−1 ^ i i−1 2 3: for i = 1 to ` do h k2 = k(W o − W µ ) − W oe k2 = 0, which ^ i ^ i i i−1 −1 4: update whitened weight matrix Wt and parameters has a closed-form solution, W = W (P ) .

Learning Deep Architectures Via Generalized Whitened Neural Networks

Robust Data Whitening As an Iteratively Re-Weighted Least Squares Problem

Arxiv:2106.04413V1 [Cs.CV] 3 Jun 2021 Improve the Generalization [8, 24, 15]

Robust Differentiable SVD

Non-Gaussian Component Analysis by Derek Merrill Bean a Dissertation

10.1109 LGRS.2012.2233711.Pdf

Robust Data Whitening As an Iteratively Re-Weighted Least Squares Problem

Understanding Generalized Whitening and Coloring Transform for Universal Style Transfer

Independent Component Analysis

Independent Component Analysis Using the ICA Procedure

Optimal Spectral Shrinkage and PCA with Heteroscedastic Noise

Discriminative Decorrelation for Clustering and Classification⋆

Isobn: Fine-Tuning BERT with Isotropic Batch Normalization