Optimal Whitening and Decorrelation

Total Page:16

File Type:pdf, Size:1020Kb

Optimal Whitening and Decorrelation The American Statistician ISSN: 0003-1305 (Print) 1537-2731 (Online) Journal homepage: https://www.tandfonline.com/loi/utas20 Optimal Whitening and Decorrelation Agnan Kessy, Alex Lewin & Korbinian Strimmer To cite this article: Agnan Kessy, Alex Lewin & Korbinian Strimmer (2018) Optimal Whitening and Decorrelation, The American Statistician, 72:4, 309-314, DOI: 10.1080/00031305.2016.1277159 To link to this article: https://doi.org/10.1080/00031305.2016.1277159 Accepted author version posted online: 19 Jan 2017. Published online: 26 Jan 2018. Submit your article to this journal Article views: 1739 View related articles View Crossmark data Citing articles: 27 View citing articles Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=utas20 THE AMERICAN STATISTICIAN ,VOL.,NO.,–:General https://doi.org/./.. Optimal Whitening and Decorrelation Agnan Kessya, Alex Lewinb, and Korbinian Strimmerc aStatistics Section, Department of Mathematics, Imperial College London, South Kensington Campus, London, United Kingdom; bDepartment of Mathematics, Brunel University London, Kingstone Lane, Uxbridge, United Kingdom; cEpidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London, United Kingdom ABSTRACT ARTICLE HISTORY Whitening, or sphering, is a common preprocessing step in statistical analysis to transform random Received December variables to orthogonality. However, due to rotational freedom there are infinitely many possible whitening Revised December procedures. Consequently, there is a diverse range of sphering methods in use, for example, based on KEYWORDS principal component analysis (PCA), Cholesky matrix decomposition, and zero-phase component analysis CAR score; CAT score; (ZCA), among others. Here, we provide an overview of the underlying theory and discuss five natural Cholesky decomposition; whitening procedures. Subsequently, we demonstrate that investigating the cross-covariance and the Decorrelation; Principal cross-correlation matrix between sphered and original variables allows to break the rotational invariance components analysis; and to identify optimal whitening transformations. As a result we recommend two particular approaches: Whitening; ZCA-Mahalanobis ZCA-cor whitening to produce sphered variables that are maximally similar to the original variables, and transformation PCA-cor whitening to obtain sphered variables that maximally compress the original variables. 1. Introduction W (Σ W TW ) = W,whichisfulfilledifW satisfies the condition Whitening,orsphering, is a linear transformation that converts − T W TW = Σ 1. a d-dimensional random vector x = (x1,...,xd ) with mean (3) E(x) = μ = (μ ,...,μ )T and positive definite d × d covari- 1 d However, unfortunately, this constraint does not uniquely ance matrix var(x) = Σ into a new random vector determine the whitening matrix W.Quitethecontrary,given Σ there are in fact infinitely many possible matrices W that all z = ( ,..., )T = Wx z1 zd (1) satisfy Equation (3), and each W leads to a whitening transfor- mation that produces orthogonal but different sphered random of the same dimension d and with unit diagonal “white” covari- variables. ance var(z) = I.Thesquared × d matrix W is called the This raises two important issues: first, how to best understand whitening matrix. As orthogonality among random variables the differences among the various sphering transformations, and greatly simplifies multivariate data analysis both from a com- second, how to select an optimal whitening procedure for a par- putational and a statistical standpoint, whitening is a critically ticular situation. Here, we propose to address these questions by important tool, most often employed in preprocessing but also investigating the cross-covariance and cross-correlation matrix as part of modeling (e.g., Zuber and Strimmer 2009;Hao,Dong, between z and x. As a result, we identify five natural whitening and Fan 2015). procedures, of which we recommend two particular approaches Whitening can be viewed as a generalization of standardizing for general use. a random variable that is carried out by − / 2. Notation and Useful Identities z = V 1 2x, (2) Inthefollowing,wewillmakeuseofanumberofcovari- / / V = (σ 2,...,σ2) ance matrix identities: the decomposition Σ = V 1 2PV 1 2 of the where the matrix diag 1 d contains the variances ( ) = σ 2 ( ) = covariance matrix into the correlation matrix P and the diagonal var xi i . This results in var zi 1 but it does not remove correlations. Often, standardization and whitening transforma- variance matrix V, and the eigendecomposition of the covari- tions are also accompanied by mean-centering of x or z to ensure ance matrix Σ = UΛU T and the eigendecomposition of the T E(z) = 0,butthisisnotactuallynecessaryforproducingunit correlation matrix P = GΘG ,whereU, G contain the eigen- variances or a white covariance. vectors and Λ, Θ the eigenvalues of Σ, P,respectively.Wewill −1/2 −1/2 The whitening transformation defined in Equation1 ( ) frequently use Σ = UΛ U T , the unique inverse matrix − / −1/2 T requires the choice of a suitable whitening matrix W. square root of Σ,aswellasP 1 2 = GΘ G , the unique Since var(z) = I it follows that WΣW T = I and thus inverse matrix square root of the correlation matrix. CONTACT Korbinian Strimmer [email protected] Epidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London W PG, United Kingdom. © American Statistical Association 310 A. KESSY, A. LEWIN, AND K. STRIMMER Following the standard convention, we assume that the The cross-covariance matrix Φ between z and x is given by eigenvalues are sorted in order from largest to smallest value. Φ = (φ ) = (z, x) = (Wx, x) In addition, we recall that by construction all eigenvectors are ij cov cov U G = WΣ = Q Σ1/2. definedonlyuptoasign,thatis,thecolumnsof and can 1 (6) be multiplied with a factor of −1 and the resulting matrix is still valid. Indeed, using different numerical algorithms and software Likewise, the cross-correlation matrix is will often result in eigendecompositions with U and G showing Ψ = (ψ ) = (z, x) = ΦV −1/2 diverse column signs. ij cor = Q A Σ1/2V −1/2 = Q P1/2. 2 2 (7) 3. Rotational Freedom in Whitening Thus, we find that the rotational freedom inherent in W,which Q Q The constraint Equation (3)onthewhiteningmatrixdoesnot is represented by the matrices 1 and 2, is directly reflected W in the corresponding cross-covariance Φ and cross-correlation fully identify but allows for rotational freedom. This becomes Ψ z x apparent by writing W in its polar decomposition between and . This provides the leverage that we will use to select and discriminate among whitening transformations by W = Q Σ−1/2, Φ Ψ 1 (4) appropriately choosing or constraining or . AscanbeseenfromEquation(6)andEquation(7), both Φ Q QT Q = I W where 1 is an orthogonal matrix with 1 1 d.Clearly, and Ψ are in general not symmetric, unless Q = I or Q = Q 1 2 satisfies Equation3 ( ) regardless of the choice of 1. I, respectively. Note that the diagonal elements of the cross- This implies a geometrical interpretation of whitening as a correlation matrix Ψ need not be equal to 1. Σ−1/2 − combination of multivariate rescaling by and rotation Furthermore, since x = W 1z, each x is perfectly explained Q W j by 1.Italsoshowsthatallwhiteningmatrices have the ,..., − / by a linear combination of the uncorrelated z1 zd,and same singular values Λ 1 2,whichfollowsfromthesingular z − / hence the squared multiple correlation between x j and equals value decomposition W = (Q U )Λ 1 2U T with Q U orthog- 1 1 1. Thus, the column sum over the squared cross-correlations onal. This highlights that the fundamental rescaling is via d ψ2 is always 1. In matrix notation, diag(Ψ T Ψ ) = Λ−1/2 i=1 ij thesquarerootoftheeigenvalues .Geometrically,the / T / − / diag(P1 2Q Q P1 2) = diag(P) = (1,...,1)T .Incontrast,the whitening transformation with W = Q UΛ 1 2U T is a rotation 2 2 1 d ψ2 U T followed by scaling, possibly followed by another rotation row sum of over the squared cross-correlations j=1 ij varies Q for different whitening procedures, and is, as we will see below, (depending on the choice of 1). Sinceinmanysituationsitisdesirabletoworkwithstandard- highly informative for choosing relevant transformations. ized variables V −1/2x another useful decomposition of W that also directly demonstrates the inherent rotational freedom is 5. Five Natural Whitening Procedures W = Q P−1/2V −1/2, 2 (5) In practical application of whitening, there are a handful of spheringproceduresthataremostcommonlyused(e.g.,Liand where Q is a further orthogonal matrix with QT Q = I .Evi- 2 2 2 d Zhang 1998). Accordingly, in Table 1 we describe the properties dently, thisW also satisfies the constraint of Equation3 ( ) regard- Q of five whitening transformations, listing the respective sphering less of the choice of 2. − / matrix W, the associated rotation matrices Q and Q ,andthe In this view, with W = Q GΘ 1 2GTV −1/2,thevariablesare 1 2 2 resulting cross-covariances Φ and cross-correlations Ψ.Allfive firstscaledbythesquarerootofthediagonalvariancematrix, methods are natural whitening procedures arising from specific then rotated by GT ,thenscaledagainbythesquarerootofthe constraints on Φ or Ψ,aswewillshowfurtherbelow. eigenvalues of the correlation matrix, and possibly rotated once The ZCA whitening transformation employs the sphering more (depending on the choice of Q ). 2 matrix For the above two representations to result in the same W Q Q ZCA −1/2 whitening matrix , two different rotations 1 and 2 are W = Σ , (8) Q = Q A A = required. These are linked by 1 2 where the matrix / − / P−1/2V −1/2Σ1 2 = P1/2V 1/2Σ 1 2 is itself orthogonal. Since the where ZCA stands for “zero-phase components analysis” (Bell eigendecompositions of the covariance and the correlation and Sejnowski 1997). This procedure is also known as Maha- Q = I matrix are not readily related to each other, the matrix A can lanobis whitening.With 1 itistheuniquespheringmethod unfortunately not be further simplified.
Recommended publications
  • Robust Data Whitening As an Iteratively Re-Weighted Least Squares Problem
    Robust Data Whitening as an Iteratively Re-weighted Least Squares Problem Arun Mukundan, Giorgos Tolias, and Ondrejˇ Chum Visual Recognition Group Czech Technical University in Prague {arun.mukundan,giorgos.tolias,chum}@cmp.felk.cvut.cz Abstract. The entries of high-dimensional measurements, such as image or fea- ture descriptors, are often correlated, which leads to a bias in similarity estima- tion. To remove the correlation, a linear transformation, called whitening, is com- monly used. In this work, we analyze robust estimation of the whitening transfor- mation in the presence of outliers. Inspired by the Iteratively Re-weighted Least Squares approach, we iterate between centering and applying a transformation matrix, a process which is shown to converge to a solution that minimizes the sum of `2 norms. The approach is developed for unsupervised scenarios, but fur- ther extend to supervised cases. We demonstrate the robustness of our method to outliers on synthetic 2D data and also show improvements compared to conven- tional whitening on real data for image retrieval with CNN-based representation. Finally, our robust estimation is not limited to data whitening, but can be used for robust patch rectification, e.g. with MSER features. 1 Introduction In many computer vision tasks, visual elements are represented by vectors in high- dimensional spaces. This is the case for image retrieval [14, 3], object recognition [17, 23], object detection [9], action recognition [20], semantic segmentation [16] and many more. Visual entities can be whole images or videos, or regions of images corresponding to potential object parts. The high-dimensional vectors are used to train a classifier [19] or to directly perform a similarity search in high-dimensional spaces [14].
    [Show full text]
  • Arxiv:2106.04413V1 [Cs.CV] 3 Jun 2021 Improve the Generalization [8, 24, 15]
    Stochastic Whitening Batch Normalization Shengdong Zhang1 Ehsan Nezhadarya∗1 Homa Fashandi∗1 Jiayi Liu#2 Darin Graham1 Mohak Shah2 1Toronto AI Lab, LG Electronics Canada 2America R&D Lab, LG Electronics USA {shengdong.zhang, ehsan.nezhadarya, homa.fashandi, jason.liu, darin.graham, mohak.shah}@lge.com Abstract Batch Normalization (BN) is a popular technique for training Deep Neural Networks (DNNs). BN uses scaling and shifting to normalize activations of mini-batches to ac- celerate convergence and improve generalization. The re- cently proposed Iterative Normalization (IterNorm) method improves these properties by whitening the activations iter- Figure 1: The SWBN diagram shows how whitening parameters and atively using Newton’s method. However, since Newton’s task parameters inside an SWBN layer are updated in a decoupled method initializes the whitening matrix independently at way. Whitening parameters (in blue rectangles) are updated only each training step, no information is shared between con- in the forward phase, and are fixed in the backward phase. Task secutive steps. In this work, instead of exact computation of parameters (in red rectangles) are fixed in the forward phase, and whitening matrix at each time step, we estimate it gradually are updated only in the backward phase. during training in an online fashion, using our proposed Stochastic Whitening Batch Normalization (SWBN) algo- Due to the change in the distribution of the inputs of DNN rithm. We show that while SWBN improves the convergence layers at each training step, the network experiences Internal rate and generalization of DNNs, its computational over- Covariate Shift (ICS) as defined in the seminal work of [17].
    [Show full text]
  • Robust Differentiable SVD
    ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE ON 31 MARCH 2021 1 Robust Differentiable SVD Wei Wang, Zheng Dang, Yinlin Hu, Pascal Fua, Fellow, IEEE, and Mathieu Salzmann, Member, IEEE, Abstract—Eigendecomposition of symmetric matrices is at the heart of many computer vision algorithms. However, the derivatives of the eigenvectors tend to be numerically unstable, whether using the SVD to compute them analytically or using the Power Iteration (PI) method to approximate them. This instability arises in the presence of eigenvalues that are close to each other. This makes integrating eigendecomposition into deep networks difficult and often results in poor convergence, particularly when dealing with large matrices. While this can be mitigated by partitioning the data into small arbitrary groups, doing so has no theoretical basis and makes it impossible to exploit the full power of eigendecomposition. In previous work, we mitigated this using SVD during the forward pass and PI to compute the gradients during the backward pass. However, the iterative deflation procedure required to compute multiple eigenvectors using PI tends to accumulate errors and yield inaccurate gradients. Here, we show that the Taylor expansion of the SVD gradient is theoretically equivalent to the gradient obtained using PI without relying in practice on an iterative process and thus yields more accurate gradients. We demonstrate the benefits of this increased accuracy for image classification and style transfer. Index Terms—Eigendecomposition, Differentiable SVD, Power Iteration, Taylor Expansion. F 1 INTRODUCTION process. When only the eigenvector associated to the largest eigenvalue is needed, this instability can be addressed using the In this paper, we focus on the eigendecomposition of symmetric Power Iteration (PI) method [20].
    [Show full text]
  • Non-Gaussian Component Analysis by Derek Merrill Bean a Dissertation
    Non-Gaussian Component Analysis by Derek Merrill Bean A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Statistics in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Peter J. Bickel, Co-chair Professor Noureddine El Karoui, Co-chair Professor Laurent El Ghaoui Spring 2014 Non-Gaussian Component Analysis Copyright 2014 by Derek Merrill Bean 1 Abstract Non-Gaussian Component Analysis by Derek Merrill Bean Doctor of Philosophy in Statistics University of California, Berkeley Professor Peter J. Bickel, Co-chair Professor Noureddine El Karoui, Co-chair Extracting relevant low-dimensional information from high-dimensional data is a common pre-processing task with an extensive history in Statistics. Dimensionality reduction can facilitate data visualization and other exploratory techniques, in an estimation setting can reduce the number of parameters to be estimated, or in hypothesis testing can reduce the number of comparisons being made. In general, dimension reduction, done in a suitable manner, can alleviate or even bypass the poor statistical outcomes associated with the so- called \curse of dimensionality." Statistical models may be specified to guide the search for relevant low-dimensional in- formation or \signal" while eliminating extraneous high-dimensional \noise." A plausible choice is to assume the data are a mixture of two sources: a low-dimensional signal which has a non-Gaussian distribution, and independent high-dimensional Gaussian noise. This is the Non-Gaussian Components Analysis (NGCA) model. The goal of an NGCA method, accordingly, is to project the data onto a space which contains the signal but not the noise.
    [Show full text]
  • 10.1109 LGRS.2012.2233711.Pdf
    © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Title: A Feature-Metric-Based Affinity Propagation Technique for Feature Selection in Hyperspectral Image Classification This paper appears in: Geoscience and Remote Sensing Letters, IEEE Date of Publication: Sept. 2013 Author(s): Chen Yang; Sicong Liu; Bruzzone, L.; Renchu Guan; Peijun Du Volume: 10, Issue: 5 Page(s): 1152-1156 DOI: 10.1109/LGRS.2012.2233711 1 A Feature-Metric-Based Affinity Propagation Technique for Feature Selection in Hyper- spectral Image Classification Chen Yang, Member, IEEE, Sicong Liu, Lorenzo Bruzzone, Fellow, IEEE, Renchu Guan, Member, IEEE and Peijun Du, Senior Member, IEEE Abstract—Relevant component analysis (RCA) has shown effective in metric learning. It finds a transformation matrix of the feature space using equivalence constraints. This paper explores the idea for constructing a feature metric (FM) and develops a novel semi-supervised feature selection technique for hyperspectral image classification. Two feature measures referred to as band correlation metric (BCM) and band separability metric (BSM) are derived for the FM. The BCM can measure the spectral correlation among the bands, while the BSM can assess the class discrimination capability of the single band. The proposed feature-metric-based affinity propagation (FM-AP) technique utilize an exemplar-based clustering, i.e. affinity propagation (AP) to group bands from original spectral channels with the FM.
    [Show full text]
  • Robust Data Whitening As an Iteratively Re-Weighted Least Squares Problem
    Robust Data Whitening as an Iteratively Re-weighted Least Squares Problem B Arun Mukundan( ), Giorgos Tolias, and Ondˇrej Chum Visual Recognition Group, Czech Technical University in Prague, Prague, Czech Republic {arun.mukundan,giorgos.tolias,chum}@cmp.felk.cvut.cz Abstract. The entries of high-dimensional measurements, such as image or feature descriptors, are often correlated, which leads to a bias in similarity estimation. To remove the correlation, a linear transforma- tion, called whitening, is commonly used. In this work, we analyze robust estimation of the whitening transformation in the presence of outliers. Inspired by the Iteratively Re-weighted Least Squares approach, we iter- ate between centering and applying a transformation matrix, a process which is shown to converge to a solution that minimizes the sum of 2 norms. The approach is developed for unsupervised scenarios, but fur- ther extend to supervised cases. We demonstrate the robustness of our method to outliers on synthetic 2D data and also show improvements compared to conventional whitening on real data for image retrieval with CNN-based representation. Finally, our robust estimation is not limited to data whitening, but can be used for robust patch rectification, e.g. with MSER features. 1 Introduction In many computer vision tasks, visual elements are represented by vectors in high-dimensional spaces. This is the case for image retrieval [3,14], object recog- nition [17,23], object detection [9], action recognition [20], semantic segmen- tation [16] and many more. Visual entities can be whole images or videos, or regions of images corresponding to potential object parts. The high-dimensional vectors are used to train a classifier [19] or to directly perform a similarity search in high-dimensional spaces [14].
    [Show full text]
  • Understanding Generalized Whitening and Coloring Transform for Universal Style Transfer
    Understanding Generalized Whitening and Coloring Transform for Universal Style Transfer Tai-Yin Chiu University of Texas as Austin [email protected] Abstract while having remarkable results, it suffers from computa- tional inefficiency. To overcome this issue, a few methods Style transfer is a task of rendering images in the styles [10, 13, 20] that use pre-computed neural networks to ac- of other images. In the past few years, neural style transfer celerate the style transfer were proposed. However, these has achieved a great success in this task, yet suffers from methods are limited by only one transferrable style and can- either the inability to generalize to unseen style images or not generalize to other unseen styles. StyleBank [1] ad- fast style transfer. Recently, an universal style transfer tech- dresses this limit by controlling the transferred style with nique that applies zero-phase component analysis (ZCA) for style filters. Whenever a new style is needed, it can be whitening and coloring image features realizes fast and ar- learned into filters while holding the neural network fixed. bitrary style transfer. However, using ZCA for style transfer Another method [3] proposes to train a conditional style is empirical and does not have any theoretical support. In transfer network that uses conditional instance normaliza- addition, other whitening and coloring transforms (WCT) tion for multiple styles. Besides these two, more methods than ZCA have not been investigated. In this report, we [2, 7, 21] to achieve arbitrary style transfer are proposed. generalize ZCA to the general form of WCT, provide an an- However, they partly solve the problem and are still not able alytical performance analysis from the angle of neural style to generalize to every unseen style.
    [Show full text]
  • Independent Component Analysis
    1 Introduction: Independent Component Analysis Ganesh R. Naik RMIT University, Melbourne Australia 1. Introduction Consider a situation in which we have a number of sources emitting signals which are interfering with one another. Familiar situations in which this occurs are a crowded room with many people speaking at the same time, interfering electromagnetic waves from mobile phones or crosstalk from brain waves originating from different areas of the brain. In each of these situations the mixed signals are often incomprehensible and it is of interest to separate the individual signals. This is the goal of Blind Source Separation (BSS). A classic problem in BSS is the cocktail party problem. The objective is to sample a mixture of spoken voices, with a given number of microphones - the observations, and then separate each voice into a separate speaker channel -the sources. The BSS is unsupervised and thought of as a black box method. In this we encounter many problems, e.g. time delay between microphones, echo, amplitude difference, voice order in speaker and underdetermined mixture signal. Herault and Jutten Herault, J. & Jutten, C. (1987) proposed that, in a artificial neural network like architecture the separation could be done by reducing redundancy between signals. This approach initially lead to what is known as independent component analysis today. The fundamental research involved only a handful of researchers up until 1995. It was not until then, when Bell and Sejnowski Bell & Sejnowski (1995) published a relatively simple approach to the problem named infomax, that many became aware of the potential of Independent component analysis (ICA).
    [Show full text]
  • Independent Component Analysis Using the ICA Procedure
    Paper SAS2997-2019 Independent Component Analysis Using the ICA Procedure Ning Kang, SAS Institute Inc., Cary, NC ABSTRACT Independent component analysis (ICA) attempts to extract from observed multivariate data independent components (also called factors or latent variables) that are as statistically independent from each other as possible. You can use ICA to reveal the hidden structure of the data in applications in many different fields, such as audio processing, biomedical signal processing, and image processing. This paper briefly covers the underlying principles of ICA and then discusses how to use the ICA procedure, available in SAS® Visual Statistics 8.3 in SAS® Viya®, to perform independent component analysis. INTRODUCTION Independent component analysis (ICA) is a method of finding underlying factors or components from observed multivariate data. The components that ICA looks for are both non-Gaussian and as statistically independent from each other as possible. ICA is one of the most widely used techniques for performing blind source separation, where “source” means an original signal or independent component, and “blind” means that the mixing process of the source signals is unknown and few assumptions about the source signals are made. You can use ICA to analyze the multidimensional data in many fields and applications, including image processing, biomedical imaging, econometrics, and psychometrics. Typical examples of multidimensional data are mixtures of simultaneous speech signals that are recorded by several microphones; brain imaging data from fMRI, MEG, or EEG studies; radio signals that interfere with a mobile phone; or parallel time series that are obtained from some industrial processes. This paper first defines ICA and discusses its underlying principles.
    [Show full text]
  • Optimal Spectral Shrinkage and PCA with Heteroscedastic Noise
    Optimal spectral shrinkage and PCA with heteroscedastic noise William Leeb∗ and Elad Romanovy Abstract This paper studies the related problems of prediction, covariance estimation, and principal compo- nent analysis for the spiked covariance model with heteroscedastic noise. We consider an estimator of the principal components based on whitening the noise, and we derive optimal singular value and eigenvalue shrinkers for use with these estimated principal components. Underlying these methods are new asymp- totic results for the high-dimensional spiked model with heteroscedastic noise, and consistent estimators for the relevant population parameters. We extend previous analysis on out-of-sample prediction to the setting of predictors with whitening. We demonstrate certain advantages of noise whitening. Specifically, we show that in a certain asymptotic regime, optimal singular value shrinkage with whitening converges to the best linear predictor, whereas without whitening it converges to a suboptimal linear predictor. We prove that for generic signals, whitening improves estimation of the principal components, and increases a natural signal-to-noise ratio of the observations. We also show that for rank one signals, our estimated principal components achieve the asymptotic minimax rate. 1 Introduction Singular value shrinkage and eigenvalue shrinkage are popular methods for denoising data matrices and covariance matrices. Singular value shrinkage is performed by computing a singular value decomposition of the observed matrix Y , adjusting the singular values, and reconstructing. The idea is that when Y = X + N, where X is a low-rank signal matrix we wish to estimate, the additive noise term N inflates the singular values of X; by shrinking them we can move the estimated matrix closer to X, even if the singular vectors remain inaccurate.
    [Show full text]
  • Discriminative Decorrelation for Clustering and Classification⋆
    Discriminative Decorrelation for Clustering and Classification? Bharath Hariharan1, Jitendra Malik1, and Deva Ramanan2 1 Univerisity of California at Berkeley, Berkeley, CA, USA fbharath2,[email protected] 2 University of California at Irvine, Irvine, CA, USA [email protected] Abstract. Object detection has over the past few years converged on using linear SVMs over HOG features. Training linear SVMs however is quite expensive, and can become intractable as the number of categories increase. In this work we revisit a much older technique, viz. Linear Dis- criminant Analysis, and show that LDA models can be trained almost trivially, and with little or no loss in performance. The covariance matri- ces we estimate capture properties of natural images. Whitening HOG features with these covariances thus removes naturally occuring correla- tions between the HOG features. We show that these whitened features (which we call WHO) are considerably better than the original HOG fea- tures for computing similarities, and prove their usefulness in clustering. Finally, we use our findings to produce an object detection system that is competitive on PASCAL VOC 2007 while being considerably easier to train and test. 1 Introduction Over the last decade, object detection approaches have converged on a single dominant paradigm: that of using HOG features and linear SVMs. HOG fea- tures were first introduced by Dalal and Triggs [1] for the task of pedestrian detection. More contemporary approaches build on top of these HOG features by allowing for parts and small deformations [2], training separate HOG detec- tors for separate poses and parts [3] or even training separate HOG detectors for each training exemplar [4].
    [Show full text]
  • Learning Deep Architectures Via Generalized Whitened Neural Networks
    Learning Deep Architectures via Generalized Whitened Neural Networks Ping Luo 1 2 Abstract approximation of the identity matrix. This is an appeal- ing property, as training WNN using stochastic gradient Whitened Neural Network (WNN) is a recent descent (SGD) mimics the fast convergence of natural advanced deep architecture, which improves con- gradient descent (NGD) (Amari & Nagaoka, 2000). The vergence and generalization of canonical neural whitening transformation also improves generalization. As networks by whitening their internal hidden rep- demonstrated in (Desjardins et al., 2015), WNN exhib- resentation. However, the whitening transforma- ited superiority when being applied to various network tion increases computation time. Unlike WNN architectures, such as autoencoder and convolutional neural that reduced runtime by performing whitening network, outperforming many previous works including every thousand iterations, which degenerates SGD, RMSprop (Tieleman & Hinton, 2012), and BN. convergence due to the ill conditioning, we present generalized WNN (GWNN), which has Although WNN is able to reduce the number of training three appealing properties. First, GWNN is iterations and improve generalization, it comes with a price able to learn compact representation to reduce of increasing training time, because eigen-decomposition computations. Second, it enables whitening occupies large computations. The runtime scales up when transformation to be performed in a short period, the number of hidden layers that require whitening trans- preserving good conditioning. Third, we propose formation increases. We revisit WNN by breaking down a data-independent estimation of the covariance its performance and show that its main runtime comes matrix to further improve computational efficien- from two aspects, 1) computing full covariance matrix cy.
    [Show full text]