Learning Deep Architectures via Generalized Whitened Neural Networks

Ping Luo 1 2

Abstract approximation of the identity matrix. This is an appeal- ing property, as training WNN using stochastic gradient Whitened Neural Network (WNN) is a recent descent (SGD) mimics the fast convergence of natural advanced deep architecture, which improves con- gradient descent (NGD) (Amari & Nagaoka, 2000). The vergence and generalization of canonical neural whitening transformation also improves generalization. As networks by whitening their internal hidden rep- demonstrated in (Desjardins et al., 2015), WNN exhib- resentation. However, the whitening transforma- ited superiority when being applied to various network tion increases computation time. Unlike WNN architectures, such as autoencoder and convolutional neural that reduced runtime by performing whitening network, outperforming many previous works including every thousand iterations, which degenerates SGD, RMSprop (Tieleman & Hinton, 2012), and BN. convergence due to the ill conditioning, we present generalized WNN (GWNN), which has Although WNN is able to reduce the number of training three appealing properties. First, GWNN is iterations and improve generalization, it comes with a price able to learn compact representation to reduce of increasing training time, because eigen-decomposition computations. Second, it enables whitening occupies large computations. The runtime scales up when transformation to be performed in a short period, the number of hidden layers that require whitening trans- preserving good conditioning. Third, we propose formation increases. We revisit WNN by breaking down a data-independent estimation of the its performance and show that its main runtime comes matrix to further improve computational efficien- from two aspects, 1) computing full cy. Extensive experiments on various datasets for whitening and 2) solving singular value decomposition demonstrate the benefits of GWNN. (SVD). Previous work (Desjardins et al., 2015) suggests to overcome these problems by a) using a subset of training data to estimate the full covariance matrix and b) solving the SVD every hundreds or thousands of training iterations. 1. Introduction Both of them rely on the assumption that the SVD holds in Deep neural networks (DNNs) have improved perfor- this period, but it is generally not true. When this period mances of many applications, as the non-linearity of DNNs becomes large, WNN degenerates to canonical SGD due to provides expressive modeling capacity, but it also makes ill conditioning of FIM. DNNs difficult to train and easy to overfit the training data. We propose generalized WNN (GWNN), which possesses Whitened neural network (WNN) (Desjardins et al., 2015), the beneficial properties of WNN, but significantly reduces a recent advanced deep architecture, is ideally to solve the its runtime and improves its generalization. We introduce above difficulties. WNN extends batch normalization (BN) two variants of GWNN, including pre-whitening and post- (Ioffe & Szegedy, 2015) by normalizing the internal hidden whitening GWNNs. The former one whitens a hidden representation using whitening transformation instead of layer’s input values, whilst the latter one whitens the pre- standardization. Whitening helps regularize each diagonal activation values (hidden features). GWNN has three ap- block of the Fisher Information Matrix (FIM) to be an pealing characteristics. First, compared to WNN, GWNN is capable of learning more compact hidden representa- 1Guangdong Provincial Key Laboratory of Computer Vi- tion, such that the SVD can be approximated by a few sion and Virtual Reality Technology, Shenzhen Institutes of top eigenvectors to reduce computation. This compact Advanced Technology, Chinese Academy of Sciences, Shen- zhen, China 2Multimedia Laboratory, The Chinese University representation also improves generalization. Second, it of Hong Kong, Hong Kong. Correspondence to: Ping Luo enables the whitening transformation to be performed in . a short period, maintaining conditioning of FIM. Third,

th by exploiting knowledge of the distribution of the hidden Proceedings of the 34 International Conference on Machine features, we calculate the covariance matrix in an analytical Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). form to further improve computational efficiency. Generalized Whitened Neural Network

i ~ i hiϕ oi-1 hˆi ϕ hˆi ϕi hˆi relu ~i-1 i od hi ϕ d d bn i-1 i i i-1 i-1 i i i-1 Pi-1 i i i-1 i i i o W o o P Wˆ o o d Wˆ o od Wˆ Pd od weight matrix whitening matrix (a) fully-connected layer (fc) (b) whitened fc layer of WNN (c) pre-whitening GWNN (d) post-whitening GWNN

Figure 1. Comparisons of differnet architectures. An ordinary fully-connected (fc) layer can be adapted into (b) a whitened fc layer, (c) a pre-GWNN layer, and (d) a post-GWNN layer. (c) and (d) learn more compact representation than (b) does.

2. Notation and Background (a) into a whitened fc layer. Its information flow becomes

We begin by defining the basic notation for feed-forward oi−1 = P i−1(oi−1 − µi−1), hˆi = Wˆ ioi−1, (1) neural network. A neural network transforms an input e e i hˆi i i i i 0 ` φ = √ , o = max(0, diag(α )φ + β ), vector o to an output vector o through a series of ` Var[hˆi] i ` hidden layers {o }i=1. We assume each layer has identical dimension for the simplicity of notation i.e. ∀oi ∈ Rd×1. where µi−1 represents a centering variable, µi−1 = In this case, all vectors and matrixes in the following E[oi−1]. P i−1 is a whitening matrix whose rows are should have d rows unless otherwise stated. As shown obtained from eigen-decomposition of Σi−1, which is in Fig.1 (a), each fully-connected (fc) layer consists of the covariance matrix of the input, Σi−1 = E[(oi−1 − a weight matrix, W i, and a set of hidden neurons, hi, µi−1)(oi−1 − µi−1)T]. The input is decorrelated by P i−1 each of which receives as input a weighted sum of outputs in the sense that its covariance matrix becomes an identity from the previous layer. We have hi = W ioi−1. In i−1 i−1T matrix, i.e. E[oe oe ] = I. To avoid ambiguity, we use this work, we take fully-connected network as an example. ‘ˆ’ to distinguish the variables in WNN and the canonical Note that the above computation can be also applied to a fc layer whenever necessary. For instance, Wˆ i represents a convolutional network, where an image patch is vectorized whitened weight matrix. In Eqn.(1), computation of the as a column vector and represented by oi−1 and each row BN layer has been simplified because we have [hˆi] = i E of W represents a filter. Wˆ iP i−1(E[oi−1] − µi−1) = 0. As the recent deep architectures typically stack a batch We define θ to be a vector consisting of all the normalization (BN) layer before the pre-activation values, whitened weight matrixes concatenated together, θ = we do not explicitly include a bias term when computing {vec(Wˆ 1)T, vec(Wˆ 2)T, ..., vec(Wˆ `)T}, where vec(·) is an hi, because it is normalized in BN, such that φi = operator that vectorizes a matrix by stacking its columns. i i √h −E[h ] , where the expectation and are computed Let L(o`, y; θ) denote a loss function of WNN, which i Var[h ] ` over a minibatch of samples. LeCun et al.(2002) showed measures the disagreement between a prediction o made that such normalization speeds up convergence even when by the network, and a target y. WNN is trained by the hidden features are not decorrelated. Furthermore, minimizing the loss function with respect to the parameter output of each layer is calculated by a nonlinear activation vector θ and two constraints function. A popular choice is the rectified linear unit, min L(o`, y; θ) (2) relu(x) = max(0, x). The precise computation for an θ i i i i output is o = max(0, diag(α )φ + β ), where diag(x) i−1 i−1T i i i s.t. E[o o ] = I, h − E[h ] = hˆ , i = 1...`. represents a matrix whose diagonal entries are x. αi and βi e e are two vectors that scale and shift the normalized features, i−1 in order to maintain the network’s representation capacity. To satisfy the first constraint, P is obtained by decom- T posing the covariance matrix, Σi−1 = U i−1Si−1U i−1 . i−1 i−1 − 1 i−1T i−1 2.1. Whitened Neural Networks We choose P = (S ) 2 U , where S is a diagonal matrix whose diagonal elements are eigenvalues This section revisits whitened neural networks (WNN). and U i−1 is an orthogonal matrix of eigenvectors. The Any neural architecture can be adapted to a WNN by first constraint holds under the construction of eigen- stacking a whitening transformation layer after the layer’s decomposition. input. For example, Fig.1 (b) adapts a fc layer as shown in The second constraint, hi − E[hi] = hˆi, enforces that the centered hidden features are the same, before and after Generalized Whitened Neural Network adapting a fc layer to WNN, as shown in Fig.1 (a) and (b). Algorithm 1 Training WNN In other words, it ensures that their representation powers 1: Init: initial network parameters θ, αi, βi; whitening matrix i−1 ˆ i i are identical. By combing the computations in Fig.1 (a) and P = I; iteration t = 0; Wt = W ; ∀i ∈ {1...`}. Eqn.(1), the second constraint implies that k(hi − E[hi]) − 2: repeat ˆi 2 i i−1 i i−1 ˆ i i−1 2 3: for i = 1 to ` do h k2 = k(W o − W µ ) − W oe k2 = 0, which ˆ i ˆ i i i−1 −1 4: update whitened weight matrix Wt and parameters has a closed-form solution, W = W (P ) . To see αi, βi using SGD. i i i−1 −1 i−1 i−1 i−1 t t this, we have hˆ = W (P ) P (o − µ ) = 5: if mod(t, τ) = 0 then i i−1 i−1 i i i−1 i−1 W (o −µ ) = h −E[h ]. The representation capacity 6: store old whitening matrix Po = P . can be preserved by mapping the whitened weight matrix 7: construct new matrix P i−1 by eigen-decomposition i−1 from the ordinary weight matrix. on Σ , which is estimated using N samples. ˆ i ˆ i i−1 i−1 −1 8: transform weight matrix Wt = Wt Po (P ) . Conditioning of the FIM. Here we show that WNN 9: end if improves training efficiency by conditioning the Fisher 10: end for information matrix (FIM) (Amari & Nagaoka, 2000). A 11: t = t + 1. 12: until convergence FIM, denoted as F , consists of ` × ` blocks. Each block is indexed by Fij, representing the covariance (co-adaptation) between the whitened weight matrixes of the i-th and j-th 0.34 ˆ i ˆ j T 0.35 0.65 0.95 layers. We have Fij = E[vec(δW )vec(δW ) ], where 1.0 δWˆ i indicates the gradient of the i-th whitened weight matrix. For example, the gradient of Wˆ i is achieved by 0.5 i−1 ˆi T ˆ i oe (δh ) , as illustrated in Eqn.(1). We have vec(δW ) = i−1 ˆi T ˆi i−1 0 vec(oe (δh ) ) = δh ⊗ oe , where ⊗ denotes the (a) (b) (c) (d) Kronecker product. In this case, Fij can be rewritten ˆi i−1 ˆj j−1 T ˆi ˆj T as E[(δh ⊗ oe )(δh ⊗ oe ) ] = E[δh (δh ) ⊗ Figure 2. Visualizations of different covariance matrixes (a)-(d), i−1 j−1 T ˆ oe (oe ) ]. By assuming δh and oe are independent as which have different Pearson’s correlations (top) with respect to demonstrated in (Raiko et al., 2012), Fij can be approx- an identity matrix. Larger Pearson’s correlation indicates higher imated by [δhˆi(δhˆj)T] ⊗ [oi−1(oj−1)T]. As a result, orthogonality. (a,b) are sampled from a uniform distribution E E e e between 0 and 1. (c,d) are generated by truncating a random when i = j, each diagonal block of F , Fii, has a block i−1 i−1 T orthogonal matrix with different numbers of columns. The diagonal structure because we have E[oe (oe ) ] = I as colorbar (right) indicates the value of each entry in these matrixes. shown in Eqn.(2), which improves conditioning of FIM and thus speeds up training. In general, WNN regularizes the diagonal blocks of FIM and achieves stronger conditioning g.1? We measure the similarity of the covariance matrix, than those methods (LeCun et al., 2002; Tieleman & i−1 i−1 T E[oe (oe ) ], with the identity matrix I. This is called Hinton, 2012) that regularized the diagonal entries. the orthogonality. We employ Pearson’s correlation1 as the Training WNN. Alg.1 summarizes training of WNN. At similarity between two matrixes. Intuitively, this measure st ˆ i −1 +1 the 1 line, the whitened weight matrix W0 is initialized has a value between and , representing negative by W i of the ordinary fc layer, which can be pretrained and positive correlations. Larger values indicate higher or sampled from a Gaussian distribution. The 4th line orthogonality. Fig.2 visualizes four randomly generated ˆ i shows that Wt is updated in each iteration t using SGD. covariance matrixes, where (a,b) are sampled from a The first and second constraints are achieved in the 7th and uniform distribution between 0 and 1. Fig.2 (c,d) are 8th lines respectively. For example, the 8th line ensures generated by truncating different numbers of columns of a that the hidden features are the same before and after randomly generated orthogonal matrix. For instance, (a,b) updating the whitening matrix. As the distribution of the have small similarity with respect to the identity matrix. hidden representation changes after every update of the In contrast, when the correlation equals 0.65 as shown in whitened weight matrix, to maintain good conditioning of (c), all entries in the diagonal are larger than 0.9 and more FIM, the whitening matrix, P i−1, needs to be reconstructed than 80% off-diagonal entries have values smaller than 0.1. frequently by performing eigen-decomposition on Σi−1, Furthermore, Pearson’s correlation is insensitive to the size which is estimated using N samples. N is typically of matrix, such that orthogonality of different layers can 104 in experiments. However, this raw strategy increases be compared together. For example, although matrixes in computation time. Desjardins et al.(2015) performed 1 th Given an identity matrix, I, and a covariance matrix, Σ, the whitening in every τ iterations as shown in the 5 line of Pearson’s correlation between them is defined as corr(Σ,I) = 3 Alg.1 to reduce computations, e.g. τ = 10 . vec(Σ)Tvec(I) √ , where vec(Σ) is a normalized vec- vec(Σ)Tvec(Σ)·vec(I)Tvec(I) How good is the conditioning of the FIM by using Al- tor by subtracting mean of all entries. Generalized Whitened Neural Network

(a)WNN ( ) 1.1 1.1 b pre-GWNN as illustrated in Fig.1 (c). When adapting a fc layer to 1 1 0.9 0.9 pre-GWNN, the whitening matrix is truncated by removing 0.8 0.8 those eigenvectors that have small eigenvalues, in order to 0.7 0.7 0.6 0.6 learn compact representation. This allows the input vector 0.5 0.5 to vary its length, so as to gradually adapt the learned 0.4 0.4 orthogonality orthogonality 0.3 conv1 0.3 conv1 representation to informative patterns with high variations, 0.2 conv4 0.2 conv4 but not noises. Learning pre-GWNN is formulated analo- 0.1 conv7 0.1 conv7 0 0 gously to learning WNN in Eqn.(2), but with one additional 01 2 3 4 5 01 2 3 4 5 iterations (1e3) iterations (1e3) constraint truncated the rank of the whitening matrix, min L(o`, y; θ) (3) Figure 3. Comparisons of conditioning when training a network- θ T in-network (Lin et al., 2014) on CIFAR-10 (Krizhevsky, 2009) i−1 0 i−1 i−1 s.t. rank(P ) ≤ d , E[oed0 oed0 ] = I, by using (a) WNN and (b) pre-GWNN. Compared to (b), i i ˆi the orthogonalities of three different layers in (a) have large h − E[h ] = h , i = 1...`. fluctuations due to the ill conditioning of whitening, which is Let d be the dimension of the original fc layer. By combin- performed in a large period τ. As a result, when τ increases, i−1 i−1 − 1 i−1T d×d WNN will degenerate to the canonical SGD. ing Eqn.(2), we have P = (S ) 2 U ∈ R , i−1 i−1 1 i−1T − 2 which is truncated by using Pd0 = (Sd0 ) Ud0 ∈ d0×d i−1 R , where Sd0 is achieved by keeping rows and (a) and (b) have different sizes, they have similar value columns associated with the first d0 large eigenvalues, of orthogonality when they are sampled from the same i−1 0 whilst Ud0 contains the corresponding d eigenvectors. distribution. The value of d0 can be tuned using a validation set. 0 d As shown in Fig.3 (a), we adopt network-in-network For simplicity, we choose d = 2 , which works well (NIN) (Lin et al., 2014) that is trained on CIFAR-10 throughout our experiments. This is inspired by the finding (Krizhevsky, 2009), and plot the orthogonalities of three in (Zhang et al., 2015), who disclosed that the first 50% different convolutional layers, which are whitened every eigenvectors contribute over 95% energy in a deep model. 3 τ = 10 iterations by using Alg.1. We see that or- More specifically, pre-GWNN first projects an input vector thogonality values during training have large fluctuations 0 i−1 i−1 i−1 to a d low-dimensional space, oed0 = Pd0 (o − except those of the first convolutional layer, abbreviated as 0 µi−1) ∈ Rd ×1. The whitened weight matrix then ‘conv1’. This is because the distributions of deeper layers’ produces a hidden feature vector of d dimensions, which inputs change after the whitened weight matrixes have has the same length as the ordinary fc layer, i.e. hˆi = been updated, leading to ill-conditions of the whitening ˆ i i−1 d×1 ˆ i i i−1 −1 d×d0 W oed0 ∈ R , where W = W (Pd0 ) ∈ R . matrixes, which are estimated in a large interval. In fact, The computations of BN and the nonlinear activation are τ large will degenerate WNN to canonical SGD. However, identical to Eqn.(1). ‘conv1’ uses image data as inputs, whose distribution is typically stable during training. Its whitening matrix can Training pre-GWNN is similar to Alg.1. The main be estimated once at the beginning and fixed in the entire modification is produced at the 7th line in order to reduce training stage. runtime. Although Alg.1 decreases number of iterations when training converged, each iteration has additional In the section below, we present generalized whitened computation time for eigen-decomposition. For example, neural networks to improve conditioning of FIM while in WNN, the required computation of full singular value reducing computation time. decomposition (SVD) is typically O(Nd2), where N represents the number of samples employed to estimate 3. Generalized Whitened Neural Networks the covariance matrix. In particular, when we have ` whitened layers and T is the number of iterations, all We present two types of generalized WNN (GWNN), in- 2 whitening transformations occupy O( Nd T ` ) runtime in cluding pre-whitening and post-whitening GWNNs. Both τ the entire training stage. In contrast, pre-GWNN performs models share beneficial properties of WNN, but have lower the popular online estimation for the top d0 eigenvectors computation time. i−1 in Pd0 such as online SVD (Shamir, 2015; Povey et al., 2015), instead of using full SVD as WNN did. This 0 3.1. Pre-whitening GWNN (N+M)d T ` 0 difference reduces runtime to O( τ 0 ), where τ This section introduces pre-whitening GWNN, abbreviated represents the whitening interval in GWNN and M is the as pre-GWNN, which performs whitening transformation number of samples used to estimate the top eigenvectors. before applying the weight matrix (i.e. whiten the input), We have M = N as employed in previous works. Generalized Whitened Neural Network

For pre-GWNN, reducing runtime and improving condi- Algorithm 2 Training post-GWNN tioning is a tradeoff, since the former requires to increase i i tw i−1 1: Init: initial θ, α , β ; and t = 0, tw = k, λ = k ; P = I, τ 0 but the latter requires to decrease it. When M = N ˆ i i Wt = W , ∀i ∈ {1...`}. 0 d and d = 2 , we compare the runtime complexity of pre- 2: repeat dτ 0 3: for i = 1 to ` do GWNN to that of WNN, and obtain a ratio of τ , which ˆ i i i 4: update Wt , αt, and βt by SGD. tells us that whitening can be performed in a short interval 5: if mod(t, τ) = 0 then without increasing runtime. For instance, as shown in 6: store old P i−1 = P i−1. 0 o d0 Fig.3 (b) when τ = 20, orthogonalities are well preserved 7: estimate mean and variance of hˆi by a minibatch of and more stable than those in (a). In this case, pre- N 0 samples or following remark2 when N 0 = 1. i−1 GWNN reduces computations of WNN by at least 20× 8: update Pd0 by online SVD. ˆ i ˆ i i−1 i−1 −1 when d > τ, which is a typical choice in recent deep 9: transform Wt = Wt Po (Pd0 ) . tw architectures (Krizhevsky et al., 2012; Lin et al., 2014) 10: tw = 1 and λ = k . where d ∈ {1024, 2048, 4096}. 11: end if 12: end for 13: t = t + 1. 3.2. Post-whitening GWNN 14: if tw < k then tw = tw + 1 end if 15: until convergence Another variant we propose is post-whitening GWNN, abbreviated as post-GWNN. Unlike WNN and pre-GWNN, post-GWNN performs whitening transformation after ap- plying the weight matrix (i.e. whiten the feature), as Remark 1. Let h ∼ N (0,I) and o = max(0, Ah + b). illustrated in Fig.1 (d). In general, post-GWNN reduces 0 0 Then [(oj − E[oj])(ok − [ok])] ≈ 0 if A is a diagonal (N +M)d T ` 0 E E runtime to O( τ 0 ), where N  N. matrix, where j, k index any two entries of o and j 6= k. Fig.1 (d) shows how to adapt a fc layer to post-GWNN. i−1 i−1 i i Suppose od0 has been whitened by Pd0 in the previous For remark1, we have A = diag(αd0 ) and b = βd0 . It layer, at the i-th layer we have tells us three things. First, by using whitening and BN, i covariance of any two different entries of od0 approaches ˆi ˆ i i−1 i−1 i i ˆi i h = W (o 0 − µ 0 ), h 0 = P 0 h , (4) d d d d zero. Second, at the iteration when we construct Pd0 , we hi ˆi i √ d0 i i i i can estimate the full covariance matrix of h using the φd0 = i , od0 = max(0, diag(αd0 )φd0 + βd0 ), T Var[h 0 ] i−1 ˆiˆi ˆ i i−1 d mean and variance of od0 , E[h h ] = W E[(od0 − i−1 i−1 i−1 T ˆ iT i−1 i−1 µd0 )(od0 − µd0 ) ]W . The mean and variance can where µ 0 = E[o 0 ]. In Eqn.(4), a feature vector d d be estimated with a minibatch of samples i.e. N 0  N. hˆi ∈ d×1 is first produced by applying a whitened weight R Third, to the extreme, when N 0 = 1, these statistics can matrix on the input, in order to recover the original feature still be computed in analytical forms leveraging remark2. length as the fc layer. A whitening matrix then projects ˆi i d0×1 h to a decorrelated feature vector hd0 ∈ R . We ˆ i i i−1 −1 d×d0 i−1 have W = W (P 0 ) ∈ , where P 0 = Remark 2. Let a x ∼ N (0, 1) and y = d R d 2 0 b i−1 1 i−1T a − 2 b b − 2 d ×d i−1 i−1 max(0, ax + b). Then [y] = √ e 2a + Ψ(− √ ) (Sd0 ) Ud0 ∈ R , and U and S contain E 2π 2 2a 2 eigenvectors and eigenvalues of the hidden features at the 2 ab − b a2+b2 b and [y ] = √ e 2a2 + Ψ(− √ ), where Ψ(x) = i − 1-th layer. E 2π 2 2a 1 − erf(x) and erf(x) is the error function. Conditioning. Here we disclose that whitening hid- den features also enforces good conditioning of FIM. At The above remark derives the mean and variance of a this point, we have decorrelated the hidden features by rectified output unit that has shift and scale parameters. It i iT i satisfying E[hd0 hd0 ] = I. Then hd0 follows a standard generalizes (Arpit et al., 2016) that presented a special case i multivariate Gaussian distribution, h 0 ∼ N (0,I). As 1 d when a = 1 and b = 0. In that case, we have E[y] = √ a result, the layer’s output follows a rectified Gaussian 2π and Var[y] = [y2]− [y]2 = 1 − 1 , which are consistent distribution, which is uncorrelated as presented in remark E E 2 2π with previous work. 1. In post-GWNN, whitening hidden features of the i − 1- th layer improves conditioning for the i-th layer. To see Extensions. Remark1 and2 can be extended to other this, by following the description in Sec.2.1, the diagonal nonlinear activation functions, such as leaky rectified unit block of FIM associated with the i-th layer can be written defined as leakyrelu(x) = max(0, x)+a min(0, x), where ˆi ˆi T i−1 i−1 i−1 i−1 T as Fii ≈ E[δh (δh ) ] ⊗ E[(od0 − µd0 )(od0 − µd0 ) ], the slope of the negative part is controlled by the coefficient where the parameters have low correlations since Fii has a a, which is fixed in (Maas et al., 2013) and is learned in (He block diagonal structure. et al., 2015). Generalized Whitened Neural Network

3.3. Training post-GWNN images as CIFAR-10, but each image is classified into 100 categories. For CIFAR-100, we select 5, 000 images from Similar to pre-GWNN, the learning problem can be formu- training set for validation. lated as d) SVHN (Netzer et al., 2011) consists of color images of ` P` feat i ˆi house numbers collected by Google Street View. The task min λL(o , y; θ) + (1 − λ) i=1 L (h , h ; θ) (5) θ is to predict the center digit (0-9) of each image, which is i 0 i iT of size 32×32. There are 73, 257 images in the training set, s.t. rank(P ) ≤ d , E[hd0 hd0 ] = I, i = 1...`. 26, 032 images for test, and 531, 131 additional examples. Eqn.(5) has two loss functions. Different from WNN We follow (Sermanet et al., 2012) to build a validation set and pre-GWNN where the feature equality constraint can by selecting 400 samples per class from the training set and be satisfied in a closed form, this constraint is treated 200 samples per class from the additional set. We didn’t as an auxiliary loss function in post-GWNN, defined as train on validation, which is for tuning hyperparameters. Lfeat(hi, hˆi) = 1 k(hi − [hi]) − hˆik2 and minimized in 2 E 2 Experimental Settings. We have two settings, an the training stage. It does not have an analytical solution unsupervised and a supervised learning settings. First, because there is a nonlinear activation function between following (Desjardins et al., 2015), we compare the above the weight matrix and the whitening matrix (i.e. in the three approaches on the task of minimizing reconstruction previous layer). In Eqn.(5), λ is a coefficient that balances error of an autoencoder on MNIST. The encoder consists the contribution of two loss functions, and 1 − λ is linearly of 4 fc sigmoidal layers, which have 1000, 500, 256, decayed as 1 − λ = k−tw , where t = 1, 2, ..., k. At each k w and 30 hidden neurons respectively. The decoder is time after we update the whitening matrix, we start decay symmetric and untied with respect to the encoder. Second, by setting tw = 1 and k indicates the iterations at which for the task of image classification on CIFAR-10, -100, we stop annealing. and SVHN, we employ the same network-in-network Alg.2 summarizes the training procedure. It preforms on- (NIN) (Lin et al., 2014) architecture, which has 9 line update of the top d0 eigenvectors of the whitening ma- convolutional layers and 3 pooling layers defined as2: trix similar to pre-GWNN. In comparison, it decreases the conv(192, 5)-conv(160, 1)-maxpool(3, 2)-conv(96, 1)- (N 0+M)d0T ` conv(192, 5)-conv(192, 1)-avgpool(3, 2)-conv(192, 1)- runtime of whitening transformation to O( τ 0 ), N+M conv(192, 5)-conv(192, 1)-conv(l, 1)-avgpool(8, 8), which is N 0+M fold reduction with respect to pre-GWNN. For example, when N = M and N 0 = 1, post-GWNN where l = 10 for CIFAR-10 and SVHN and l = 100 for is capable of reducing computations of pre-GWNN and CIFAR-100. For all models, we use SGD with momentum WNN by 2× and (2τ 0)× respectively, while maintaining of 0.9. better conditioning than these alternatives by choosing small τ 0. 4.1. Comparisons of Convergence and Computations We record the number of epochs and computation time, 4. Empirical Studies when training WNN, pre-, and post-GWNN on MNIST and CIFAR-100, respectively. We employ the first setting We compare WNN, pre-GWNN, and post-GWNN in the above for MNIST and the second setting for CIFAR- following aspects, including a) number of iterations when 100. For both settings, hyperparamters are chosen by grid training converged, b) computation times for training, and search on the validation sets. The search specifications of c) generalization capacities on various datasets. We also minibatch size, learning rate, and whitening interval τ are conduct ablation studies with respect to 1) effect of the {64, 128, 256}, {0.1, 0.01, 0.001}, and {20, 50, 100, 103}, number of samples N to estimate the covariance matrix for respectively. In particular, for WNN and pre-GWNN, the 0 pre-GWNN and 2) effect of N for post-GWNN. Finally, number of samples used to estimate the covariance matrix, 0 we try to tune the value of d . 3 104 4 N, is picked up from {10 , 2 , 10 }. For post-GWNN, Datasets. We employ the following datasets. N 0 is chosen to be the same as the minibatch size and the a) MNIST (Lecun et al., 1998) has 60, 000 28 × 28 images decay period k = 0.1τ. For a fair comparison, we report of 10 handwritten digits (0-9) for training and another the best performance on validation set for each approach, 10, 000 test images. 5, 000 images from the training set and didn’t employ any data augmentation such as random are randomly selected as a validation set. image cropping and flipping. b) CIFAR-10 (Krizhevsky, 2009) consists of 50, 000 32 × 2The ‘conv’, ‘maxpool’, and ‘avgpool’ represent convolution, 32 color images for training and 10, 000 images for testing. max pooling, and average pooling respectively. Each convolu- Each image is categorized into one of the 10 object labels. tional layer is defined as conv(number of filters, filter size). For CIFAR-10, 5, 000 images are chosen for validation. For each pooling layer, we have pool(kernel size, stride). All c) CIFAR-100 (Krizhevsky, 2009) has the same number of convolutions have stride 1. Generalized Whitened Neural Network

1 1 0.8 0.8 SGD SGD SGD SGD 0.7 post‐GWNN 0.9 (a) post‐GWNN 0.9 (b)post‐GWNN (c) 0.7 (d)post‐GWNN pre‐GWNN pre‐GWNN pre‐GWNN 0.6 0.6 pre‐GWNN 0.8 WNN 0.8 WNN WNN WNN 0.5 0.5 error 0.7 error 0.7 0.4 0.4 test test 0.3 0.6 0.6 0.3 validation error validation error 0.2 0.2 0.5 0.5 0.1 0.1 0.4 0.4 0 0 0 3 6 9 12 15 18 21 24 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 8 10 12 14 16 18 0 5 10 15 20 25 30 35 epochs hour epochs minutes

Figure 4. Training of WNN, pre-, post-GWNN on CIFAR-100 (a,b) and MNIST (c,d). (a) and (b) plot the convergence and computation time on the validation and test set of CIFAR-100 respectively. (c) and (d) report corresponding results on MNIST.

The convergence and computation time are reported in pre‐GWNN N100 post‐GWNN N'128 Fig.4 (a-d). We have several important observations. 0.9 0.9 pre‐GWNN N1000(a) post‐GWNN N'64(b) post‐GWNN N'1 First, all three approaches converge much faster than the 0.8 pre‐GWNN N3000 0.8 canonical network trained by SGD. Second, pre- and post- pre‐GWNN N5000 pre‐GWNN N10000 GWNN achieve better convergence than WNN on both 0.7 0.7 datasets as shown in (a) and (c). Moreover, post-GWNN 0.6 0.6 validation error outperforms pre-GWNN. Third, post-GWNN significantly 0.5 0.5 reduces computation time compared to all the other meth- 0.4 0.4 ods, as illustrated in (b) and (d). We see that although 0 2 4 6 8 1012141618202224 0 3 6 9 1215182124 WNN reduces the number of epochs, it takes long time to epochs epochs train because its whitening transformation occupies large computations. Figure 5. Training of pre- and post-GWNN on CIFAR-100. (a) visualizes the impact of different values of N for pre-GWNN, 4.2. Performances on various Datasets showing that performance degrades when N is small. (b) plots the impact of N 0 for post-GWNN, which is insensitive to small We evaluate WNN, pre-, and post-GWNN on CIFAR-10, values of N 0. -100, and SVHN datasets, and compare their classification accuracies to existing state-of-the-art methods. For all the datasets and approaches, we utilize the same network structure as mentioned in the second setting above. For 4.3. Ablation Studies two CIFAR datasets, we adopt minibatch size 64 and initial learning rate 0.1, which is reduced by half after every 25 The following experiments are conducted on CIFAR-100 epochs. We train for 250 epochs. As SVHN is a large using pre- or post-GWNN. The first two experiments dataset, we train for 100 epochs with minibatch size 128 follow the setting as mentioned in Sec.4.1. First, we and initial learning rate 0.05, which is reduced by half after evaluate the effect of the number of samples, N, used to every 10 epochs. We train on CIFAR-10 and -100 using estimate the covariance matrix in pre-GWNN. We compare both without and with data augmentation, which includes performances of using different values of N picked up 2 3 3 3 4 random cropping and horizontal flipping. For SVHN, we from {10 , 10 , 3 × 10 , 5 × 10 , 10 }. Fig.5 (a) plots the didn’t augment data following (Sermanet et al., 2012). results. We see that performance can drop because of ill For all the methods, we shuffle samples at the beginning conditioning when N is small e.g. N = 100. When it is 4 e.g. N = 104 of every epoch. We use N = 10 for WNN and pre- too large , we observe slightly overfitting. 2 N 0 GWNN and N 0 = 64 for post-GWNN. For both pre- Second, Fig.5 (b) highlights the effect of in post- and post-GWNN, we have M = N and d0 = d . The GWNN. We see that post-GWNN can work reasonably 2 N 0 other experimental settings are similar to Sec.4.1. Table well when is small. 1 shows the results. We see that pre- and post-GWNN 0 d Finally, instead of treating d = 2 as a constant in consistently achieve better results than those of WNN, and training, we study the effect of tuning its value on also outperform previous state-of-the-art approaches. the validation set using a simple heuristic strategy. If the validation error reduces more than 2% over 4 con- secutive evaluations, we have d0 = d0 − rate × d0. Generalized Whitened Neural Network 5. Conclusion Table 1. Comparisons of test errors on various datasets. The top two performances are highlighted for each dataset. We presented generalized WNN (GWNN) to reduce run- CIFAR-10 Error(%) time and improve generalization of WNN. Different from NIN (Lin et al., 2014) 10.47 NIN+ALP (Agostinelli et al., 2015) 9.59 WNN that reduces computation time by whitening with a Normalization Propagation (Arpit et al., 2016) 9.11 large period, leading to ill conditioning of FIM, GWNN BatchNorm (Ioffe & Szegedy, 2015) 9.41 DSN (Lee et al., 2015) 9.69 learns compact internal representation, such that SVD is Maxout (Goodfellow et al., 2013) 11.68 approximated by the top eigenvectors in an online manner, WNN (Desjardins et al., 2015) 9.87 pre-GWNN 9.34 making GWNN not only reduces computations but also without augmentation post-GWNN 8.97 improves generalization. By exploiting the knowledge of NIN (Lin et al., 2014) 8.81 NIN+ALP (Agostinelli et al., 2015) 7.51 the hidden representation’s distribution, we showed that Normalization Propagation (Arpit et al., 2016) 7.47 post-GWNN is able to compute the covariance matrix in BatchNorm (Ioffe & Szegedy, 2015) 7.25 DSN (Lee et al., 2015) 7.97 a closed form, which can be also extended to the other Maxout (Goodfellow et al., 2013) 9.38 activation function. Extensive experiments demonstrated WNN (Desjardins et al., 2015) 8.02

with augmentation pre-GWNN 7.38 the effectiveness of GWNN. post-GWNN 7.10 CIFAR-100 Error(%) NIN (Lin et al., 2014) 35.68 Acknowledgements NIN+ALP (Agostinelli et al., 2015) 34.40 Normalization Propagation (Arpit et al., 2016) 32.19 BatchNorm (Ioffe & Szegedy, 2015) 35.32 This work is partially supported by the National Natu- DSN (Lee et al., 2015) 34.57 ral Science Foundation of China (61503366, 61472410, Maxout (Goodfellow et al., 2013) 38.57 WNN (Desjardins et al., 2015) 34.78 U1613211), the National Key Research and Developmen- pre-GWNN 32.08 t Program of China (No.2016YFC1400700), the Exter- without augmentation post-GWNN 31.10 NIN (Lin et al., 2014) 33.37 nal Cooperation Program of BIC, Chinese Academy of NIN+ALP (Agostinelli et al., 2015) 30.83 Sciences (No.172644KYSB20160033), and the Science Normalization Propagation (Arpit et al., 2016) 29.24 BatchNorm (Ioffe & Szegedy, 2015) 30.26 and Technology Planning Project of Guangdong Province DSN (Lee et al., 2015) 32.16 (2015B010129013, 2014B050505017). WNN (Desjardins et al., 2015) 30.02 pre-GWNN 28.78

with augmentation post-GWNN 28.10 SVHN Error(%) References NIN (Lin et al., 2014) 2.35 NIN+ALP (Agostinelli et al., 2015) 2.04 Agostinelli, Forest, Hoffman, Matthew, Sadowski, Peter, Normalization Propagation (Arpit et al., 2016) 1.88 and Baldi, Pierre. Learning activation functions to BatchNorm (Ioffe & Szegedy, 2015) 2.25 DSN (Lee et al., 2015) 1.92 improve deep neural networks. In ICLR, 2015. Maxout (Goodfellow et al., 2013) 2.47 WNN (Desjardins et al., 2015) 1.93 Amari, Shun-ichi and Nagaoka, Hiroshi. Methods of pre-GWNN 1.82 post-GWNN 1.74 information geometry. In Tanslations of Mathematical Monographs, 2000.

Arpit, Devansh, Zhou, Yingbo, Kota, Bhargava U., and Govindaraju, Venu. Normalization propagation: A If the error has no re- parametric technique for removing internal covariate duction over this period, rate0.1 28.57 shift in deep networks. In ICML, 2016. 160 rate0.2 29.79 d0 is increased by the Desjardins, Guillaume, Simonyan, Karen, Pascanu, Raz- same rate as above. We 120 van, and Kavukcuoglu, Koray. Natural neural networks. use post-GWNN and fol- 80 In NIPS, 2015. low experimental setting in dimensions 40 Sec.4.2. We take two d- Goodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi, ifferent rates {0.1, 0.2} as 0 Courville, Aaron, and Bengio, Yoshua. Maxout network- examples. Fig.6 plots the 0 50 100 150 200 250 epochs s. In ICML, 2013. variations of dimensions Figure 6. Effect of tuning d0. when d = 192 and shows He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, their test errors. We find Jian. Delving deep into rectifiers: Surpassing human- 0 that keeping d as a constant generally produces better level performance on imagenet classification. In ICCV, result than those obtained by the above strategy, but this 2015. strategy yields less runtime because more dimensions are pruned. Ioffe, Sergey and Szegedy, Christian. Batch normalization: Generalized Whitened Neural Network

Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. Krizhevsky, Alex. Learning multiple layers of features from tiny images. In Technical Report, 2009. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- based learning applied to document recognition. In Proceeding of IEEE, 1998. LeCun, Yann, Bottou, Leon, Orr, Genevieve B., and Mller, Klaus Robert. Efficient backprop. In Neural Networks: Tricks of the Trade, 2002. Lee, Chen-Yu, Xie, Saining, Gallagher, Patrick, Zhang, Zhengyou, and Tu, Zhuowen. Deeply-supervised nets. In AISTATS, 2015. Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network. In ICLR, 2014. Maas, Andrew L., Hannun, Awni Y., , and Ng, Andrew Y. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of dnns with natural gradient and pa- rameter averaging. In ICLR workshop, 2015. Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in per- ceptrons. In AISTATS, 2012. Sermanet, Pierre, Chintala, Soumith, and LeCun, Yann. Convolutional neural networks applied to house numbers digit classification. In arXiv:1204.3968, 2012. Shamir, Ohad. A stochastic pca and svd algorithm with an exponential convergence rate. In ICML, 2015. Tieleman, Tijmen and Hinton, Geoffrey. Rmsprop: Divide the gradient by a running average of its recent magni- tude. In Neural Networks for Machine Learning (Lecture 6.5), 2012. Zhang, Xiangyu, Zou, Jianhua, He, Kaiming, and Sun, Jian. Accelerating very deep convolutional networks for classification and detection. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.