The American Statistician

ISSN: 0003-1305 (Print) 1537-2731 (Online) Journal homepage: https://www.tandfonline.com/loi/utas20

Optimal Whitening and Decorrelation

Agnan Kessy, Alex Lewin & Korbinian Strimmer

To cite this article: Agnan Kessy, Alex Lewin & Korbinian Strimmer (2018) Optimal Whitening and Decorrelation, The American Statistician, 72:4, 309-314, DOI: 10.1080/00031305.2016.1277159 To link to this article: https://doi.org/10.1080/00031305.2016.1277159

Accepted author version posted online: 19 Jan 2017. Published online: 26 Jan 2018.

Submit your article to this journal

Article views: 1739

View related articles

View Crossmark data

Citing articles: 27 View citing articles

Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=utas20 THE AMERICAN STATISTICIAN ,VOL.,NO.,–:General https://doi.org/./..

Optimal Whitening and Decorrelation

Agnan Kessya, Alex Lewinb, and Korbinian Strimmerc aStatistics Section, Department of Mathematics, Imperial College London, South Kensington Campus, London, United Kingdom; bDepartment of Mathematics, Brunel University London, Kingstone Lane, Uxbridge, United Kingdom; cEpidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London, United Kingdom

ABSTRACT ARTICLE HISTORY Whitening, or sphering, is a common preprocessing step in statistical analysis to transform random Received December  variables to orthogonality. However, due to rotational freedom there are infinitely many possible whitening Revised December  procedures. Consequently, there is a diverse range of sphering methods in use, for example, based on KEYWORDS principal component analysis (PCA), Cholesky matrix decomposition, and zero-phase component analysis CAR score; CAT score; (ZCA), among others. Here, we provide an overview of the underlying theory and discuss five natural ; whitening procedures. Subsequently, we demonstrate that investigating the cross- and the Decorrelation; Principal cross-correlation matrix between sphered and original variables allows to break the rotational invariance components analysis; and to identify optimal whitening transformations. As a result we recommend two particular approaches: Whitening; ZCA-Mahalanobis ZCA-cor whitening to produce sphered variables that are maximally similar to the original variables, and transformation PCA-cor whitening to obtain sphered variables that maximally compress the original variables.

1. Introduction W (Σ W TW ) = W,whichisfulfilledifW satisfies the condition Whitening,orsphering, is a linear transformation that converts − T W TW = Σ 1. a d-dimensional random vector x = (x1,...,xd ) with mean (3) E(x) = μ = (μ ,...,μ )T and positive definite d × d covari- 1 d However, unfortunately, this constraint does not uniquely ance matrix var(x) = Σ into a new random vector determine the whitening matrix W.Quitethecontrary,given Σ there are in fact infinitely many possible matrices W that all z = ( ,..., )T = Wx z1 zd (1) satisfy Equation (3), and each W leads to a whitening transfor- mation that produces orthogonal but different sphered random of the same dimension d and with unit diagonal “white” covari- variables. ance var(z) = I.Thesquared × d matrix W is called the This raises two important issues: first, how to best understand whitening matrix. As orthogonality among random variables the differences among the various sphering transformations, and greatly simplifies multivariate data analysis both from a com- second, how to select an optimal whitening procedure for a par- putational and a statistical standpoint, whitening is a critically ticular situation. Here, we propose to address these questions by important tool, most often employed in preprocessing but also investigating the cross-covariance and cross-correlation matrix as part of modeling (e.g., Zuber and Strimmer 2009;Hao,Dong, between z and x. As a result, we identify five natural whitening and Fan 2015). procedures, of which we recommend two particular approaches Whitening can be viewed as a generalization of standardizing for general use. a that is carried out by

− / 2. Notation and Useful Identities z = V 1 2x, (2) Inthefollowing,wewillmakeuseofanumberofcovari- / / V = (σ 2,...,σ2) ance matrix identities: the decomposition Σ = V 1 2PV 1 2 of the where the matrix diag 1 d contains the ( ) = σ 2 ( ) = into the correlation matrix P and the diagonal var xi i . This results in var zi 1 but it does not remove correlations. Often, standardization and whitening transforma- matrix V, and the eigendecomposition of the covari- tions are also accompanied by mean-centering of x or z to ensure ance matrix Σ = UΛU T and the eigendecomposition of the T E(z) = 0,butthisisnotactuallynecessaryforproducingunit correlation matrix P = GΘG ,whereU, G contain the eigen- variances or a white covariance. vectors and Λ, Θ the eigenvalues of Σ, P,respectively.Wewill −1/2 −1/2 The whitening transformation defined in Equation1 ( ) frequently use Σ = UΛ U T , the unique inverse matrix − / −1/2 T requires the choice of a suitable whitening matrix W. square root of Σ,aswellasP 1 2 = GΘ G , the unique Since var(z) = I it follows that WΣW T = I and thus inverse matrix square root of the correlation matrix.

CONTACT Korbinian Strimmer [email protected] Epidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London W PG, United Kingdom. ©  American Statistical Association 310 A. KESSY, A. LEWIN, AND K. STRIMMER

Following the standard convention, we assume that the The cross-covariance matrix Φ between z and x is given by eigenvalues are sorted in order from largest to smallest value. Φ = (φ ) = (z, x) = (Wx, x) In addition, we recall that by construction all eigenvectors are ij cov cov U G = WΣ = Q Σ1/2. definedonlyuptoasign,thatis,thecolumnsof and can 1 (6) be multiplied with a factor of −1 and the resulting matrix is still valid. Indeed, using different numerical algorithms and software Likewise, the cross-correlation matrix is will often result in eigendecompositions with U and G showing Ψ = (ψ ) = (z, x) = ΦV −1/2 diverse column signs. ij cor = Q A Σ1/2V −1/2 = Q P1/2. 2 2 (7) 3. Rotational Freedom in Whitening Thus, we find that the rotational freedom inherent in W,which Q Q The constraint Equation (3)onthewhiteningmatrixdoesnot is represented by the matrices 1 and 2, is directly reflected W in the corresponding cross-covariance Φ and cross-correlation fully identify but allows for rotational freedom. This becomes Ψ z x apparent by writing W in its polar decomposition between and . This provides the leverage that we will use to select and discriminate among whitening transformations by W = Q Σ−1/2, Φ Ψ 1 (4) appropriately choosing or constraining or . AscanbeseenfromEquation(6)andEquation(7), both Φ Q QT Q = I W where 1 is an orthogonal matrix with 1 1 d.Clearly, and Ψ are in general not symmetric, unless Q = I or Q = Q 1 2 satisfies Equation3 ( ) regardless of the choice of 1. I, respectively. Note that the diagonal elements of the cross- This implies a geometrical interpretation of whitening as a correlation matrix Ψ need not be equal to 1. Σ−1/2 − combination of multivariate rescaling by and rotation Furthermore, since x = W 1z, each x is perfectly explained Q W j by 1.Italsoshowsthatallwhiteningmatrices have the ,..., − / by a linear combination of the uncorrelated z1 zd,and same singular values Λ 1 2,whichfollowsfromthesingular z − / hence the squared multiple correlation between x j and equals value decomposition W = (Q U )Λ 1 2U T with Q U orthog- 1 1 1. Thus, the column sum over the squared cross-correlations onal. This highlights that the fundamental rescaling is via d ψ2 is always 1. In matrix notation, diag(Ψ T Ψ ) = Λ−1/2 i=1 ij thesquarerootoftheeigenvalues .Geometrically,the / T / − / diag(P1 2Q Q P1 2) = diag(P) = (1,...,1)T .Incontrast,the whitening transformation with W = Q UΛ 1 2U T is a rotation 2 2  1 d ψ2 U T followed by scaling, possibly followed by another rotation row sum of over the squared cross-correlations j=1 ij varies Q for different whitening procedures, and is, as we will see below, (depending on the choice of 1). Sinceinmanysituationsitisdesirabletoworkwithstandard- highly informative for choosing relevant transformations. ized variables V −1/2x another useful decomposition of W that also directly demonstrates the inherent rotational freedom is 5. Five Natural Whitening Procedures W = Q P−1/2V −1/2, 2 (5) In practical application of whitening, there are a handful of spheringproceduresthataremostcommonlyused(e.g.,Liand where Q is a further orthogonal matrix with QT Q = I .Evi- 2 2 2 d Zhang 1998). Accordingly, in Table 1 we describe the properties dently, thisW also satisfies the constraint of Equation3 ( ) regard- Q of five whitening transformations, listing the respective sphering less of the choice of 2. − / matrix W, the associated rotation matrices Q and Q ,andthe In this view, with W = Q GΘ 1 2GTV −1/2,thevariablesare 1 2 2 resulting cross- Φ and cross-correlations Ψ.Allfive firstscaledbythesquarerootofthediagonalvariancematrix, methods are natural whitening procedures arising from specific then rotated by GT ,thenscaledagainbythesquarerootofthe constraints on Φ or Ψ,aswewillshowfurtherbelow. eigenvalues of the correlation matrix, and possibly rotated once The ZCA whitening transformation employs the sphering more (depending on the choice of Q ). 2 matrix For the above two representations to result in the same W Q Q ZCA −1/2 whitening matrix , two different rotations 1 and 2 are W = Σ , (8) Q = Q A A = required. These are linked by 1 2 where the matrix / − / P−1/2V −1/2Σ1 2 = P1/2V 1/2Σ 1 2 is itself orthogonal. Since the where ZCA stands for “zero-phase components analysis” (Bell eigendecompositions of the covariance and the correlation and Sejnowski 1997). This procedure is also known as Maha- Q = I matrix are not readily related to each other, the matrix A can lanobis whitening.With 1 itistheuniquespheringmethod unfortunately not be further simplified. with a symmetric whitening matrix.

Table . Five natural whitening transformations and their properties. 4. Cross-Covariance and Cross-Correlation Sphering Cross- Cross- Rotation Rotation For studying the properties of the different whitening proce- matrix covariance correlation matrix matrix W ΦΨQ Q dures, we will now focus on two particularly useful quanti- 1 2 ties, namely, the cross-covariance and cross-correlation matrix ZCA Σ−1/2 Σ1/2 Σ1/2V −1/2 IAT between the whitened vector z and the original random vector PCA Λ−1/2U T Λ1/2U T Λ1/2U T V −1/2 U T U T AT T T T − / T / T / / x. As it turns out, these are closely linked to the rotation matri- Cholesky L L Σ L ΣV 1 2 L Σ1 2 L V 1 2P1 2 P−1/2V −1/2 P1/2V 1/2 P1/2 AI ces Q and Q encountered in the above two decompositions of ZCA-cor 1 2 PCA-cor Θ−1/2GT V −1/2 Θ1/2GT V 1/2 Θ1/2GT GT AGT W (Equation (4)andEquation(5)). THE AMERICAN STATISTICIAN 311

PCA whitening is based on scaled principal component anal- the standardized variables. Both the ZCA-cor and the PCA-cor ysis (PCA) and uses transformation naturally follow the decomposition of Equation (5), with Q equal to I and GT ,respectively. W PCA = Λ−1/2U T 2 (9) Similarly as in PCA whitening, the PCA-cor whitening (e.g., Friedman 1987). This transformation first rotates the vari- matrix given by Equation (12) is subject to sign ambiguity of G (G)> ables using the eigenmatrix of the covariance Σ as is done the eigenvectors in . As above, setting diag 0leadstothe in standard PCA. This results in orthogonal components, but unique PCA-cor whitening transformation with positive diago- Φ Ψ with in general different variances. To achieve whitened data, nal cross-covariance and cross-correlation (see Table 1). the rotated variables are then scaled by the square root of the Finally, we may also apply the Cholesky whitening trans- − / eigenvalues Λ 1 2. PCA whitening is probably the most widely formation to standardized variables. However, this does not applied whitening procedure due to its connection with PCA. lead to a new whitening procedure, as the resulting sphering W Chol ItcanbeseenthatthePCAandZCAwhiteningtransforma- matrix remains identical to since the Cholesky factor P−1 V 1/2L tions are related by a rotation U, so ZCA whitening can be inter- of the inverse correlation matrix is , and therefore W Chol-cor = (V 1/2L)TV −1/2 = LT = W Chol preted as rotation followed by scaling followed by the rotationU . backtotheoriginalcoordinatesystem.TheZCAandthePCA sphering methods both naturally follow the polar decomposi- 6. Optimal Whitening Q I U T tion of Equation (4), with 1 equal to and ,respectively. Due to the sign ambiguity of eigenvectors U,thePCA We now demonstrate how an optimal sphering matrix W,and whitening matrix given by Equation (9)isstillnotunique.How- henceanoptimalwhiteningapproach,canbeidentifiedbyeval- ever, adjusting column signs in U such that diag(U )>0, that uating suitable objective functions computed from the cross- is, that all diagonal elements are positive, results in the unique covariance Φ and cross-correlation Ψ. Intriguingly, for each of PCA whitening transformation with positive diagonal cross- the five natural whitening transforms listed in Table 1,wefinda covariance Φ and cross-correlation Ψ (see Table 1). corresponding optimality criterion. Another widely known procedure is Cholesky whitening, which is based on Cholesky factorization of the precision matrix − 6.1. ZCA-Mahalanobis Whitening LLT = Σ 1.Thisleadstothespheringmatrix In many applications of whitening, it is desirable to remove cor- W Chol = LT , (10) relations with minimal additional adjustment, with the aim that z where L is the unique lower triangular matrix with positive diag- the transformed variable remains as similar as possible to the x onal values. The same matrix L canalsobeobtainedfromaQR original vector . / decomposition of W ZCA = (Σ1 2L) LT . One possible implementation of this idea is to find the A further approach is the ZCA-cor whitening transformation, whitening transformation that minimizes the total squared dis- which is used, for example, in the CAT (correlation-adjusted tance between the original and whitened variables (e.g., Eldar z t-score) and CAR (correlation-adjusted marginal correlation) and Oppenheim 2003). Using mean-centered random vectors c x (z ) = (x ) = variable importance and variable selection statistics (Zuber and and c with E c 0andE c 0, this least-square objective can be expressed as Strimmer 2009; Ahdesmäki and Strimmer 2010;Zuberand      Strimmer 2011;Zuber,DuarteSilva,andStrimmer2012). ZCA- (z − x )T (z − x ) = (I) − z xT + (Σ) E c c c c tr 2E tr c c tr cor whitening employs = d − 2tr(Φ) + tr(V ). (13) ZCA-cor − / − / W = P 1 2V 1 2 (11) Since the dimension d and sum of the variances tr(V ) = d σ 2 W as its sphering matrix. It arises from first standardizing the ran- i=1 i do not depend on the whitening matrix , minimiz- dom variable by multiplication with V −1/2 and subsequently ing Equation (13) is equivalent to maximizing the trace of the employing ZCA whitening based on the correlation rather than cross-covariance matrix W ZCA-cor covariance matrix. The resulting whitening matrix dif- d ZCA    fers fromW , and unlike the latter it is in general asymmetric. (Φ) = ( , ) = Q Σ1/2 ≡ (Q ). tr cov zi xi tr 1 g1 1 (14) In a similar fashion, PCA-cor whitening is conducted i=1 by applying PCA whitening to standardized variables. This (Q ) approach uses Proposition 1. Maximization of g1 1 uniquely determines the optimal whitening matrix to be the symmetric sphering matrix PCA-cor − / − / W = Θ 1 2GTV 1 2 (12) W ZCA. (Q ) = (Q UΛ1/2U T ) = (Λ1/2U T Q U ) ≡ as its sphering matrix. Here, the standardized variables are Proof. g1 1  tr 1 tr 1 (Λ1/2B) = Λ1/2 Λ Q U rotated by the eigenmatrix of the correlation matrix, followed tr i ii Bii since is diagonal. As 1 and W PCA-cor B ≡ U T Q U by scaling using the correlation eigenvalues. Note that are both orthogonal, 1 is also orthogonal. This PCA differs from W . implies diagonal entries Bii ≤ 1, with equality signs for all B = I (Q ) PCA-cor whitening has the same relation to the ZCA-cor i occurring only if ,hencethemaximumofg1 1 is B = I Q = I transformation as does PCA whitening to the ZCA transfor- assumed at ,orequivalentlyat 1 .FromEquation mation. Specifically, ZCA-cor whitening can be interpreted as (4), it follows that the corresponding optimal sphering matrix − / PCA-cor whitening followed by a rotation G back to the frame of is W = Σ 1 2 = W ZCA.  312 A. KESSY, A. LEWIN, AND K. STRIMMER

For related proofs, see also Johnson (1966), Genizi (1993, written as p. 412), and Garthwaite et al. (2012, p. 789).     (φ ,...,φ )T = ΦΦT = Q ΣQT ≡ h (Q ). As a result, we find that ZCA-Mahalanobis whitening is the 1 d diag diag 1 1 1 1 unique procedure that maximizes the average cross-covariance (17) z φ between each component of the whitened and original vectors. Our aim is to find a whitened vector such that the i are max- φ ≥ φ Q = I imized with i i+1. Furthermore, with 1 itisalsotheuniquewhiteningproce- Φ h (Q ) dure with a symmetric cross-covariance matrix . Proposition 3. Maximization of 1 1 subject to monotonically PCA decreasing φi is achieved by the whitening matrix W . h (Q ) (Q ΣQT ) = 6.2. ZCA-cor Whitening Proof. The vector 1 1 canbewrittenasdiag 1 1 diag(Q UΛU T QT ).SettingQ = U T we arrive at h (U T ) = In the optimization using Equation (13), the underlying similar- 1 1 1 1 diag(Λ), that is, for this choice the φ corresponding to each ity measure is the cross-covariance between the whitened and i component of z are equal to the corresponding eigenvalues of original random variables. This results in an optimality crite- i Σ. As the eigenvalues are already sorted in decreasing order, rion that depends on the variances and hence on the scale of we find (see Table 1)thatwhiteningwithW PCA leads to a the original variables. An alternative scale-invariant objective sphered variable z with monotonically decreasing φ .Forgen- can be constructed by comparing the centered whitened variable  i − / eral Q ,theith element of h (Q ) is Λ D2 where D ≡ Q U with the centered standardized vector V 1 2x .Thisleadstothe 1 1 1 j jj ij 1 c is orthogonal. This is maximized when D = I,orequivalently, minimization of Q = U T    1 . − / − / E (z − V 1 2x )T (z − V 1 2x ) = 2d − 2tr(Ψ ). (15) c c c c As a result, PCA whitening is singled out as the unique spher- ing procedure that maximizes the integration, or compression, of Equivalently, we can maximize instead the trace of the cross- x correlation matrix all components of the original vector in each component of the sphered vector z based on the cross-covariance Φ as underlying d   measure. Thus, the fundamental property of PCA that principal (Ψ ) = ( , ) = Q P1/2 ≡ (Q ). tr cor zi xi tr 2 g2 2 (16) components are optimally ordered with respect to dimension i reduction (Jolliffe 2002) carries over also to PCA whitening. (Q ) Proposition 2. Maximization of g2 2 uniquely determines the whitening matrix to be the asymmetric sphering matrix 6.4. PCA-cor Whitening W ZCA-cor. For reasons of scale-invariance, we prefer to optimize cross- Proof. Completely analogous to Proposition 1,wecanwrite correlations rather than cross-covariances for whitening with (Q ) = (Q GΘ1/2GT ) = Θ1/2 C ≡ GT Q G g2 2 tr 2 i ii Cii where 2 is compression in mind. This leads to the row sum of squared Q = ψ = d ψ2 = d ( , )2 orthogonal. By the same argument as before it follows that 2 cross-correlation i j=1 ij j=1 cor zi x j as mea- I (Q ) W = maximizes g2 2 .FromEquation(5)itfollowsthat sure of integration and compression, and correspondingly to the − / − / ZCA-cor P 1 2V 1 2 = W .  objective function     (ψ ,...,ψ )T = ΨΨT = Q PQT = h (Q ). As a result, we identify ZCA-cor whitening as the unique pro- 1 d diag diag 2 2 2 2 cedure that ensures that the components of the whitened vector z (18) remain maximally correlated with the corresponding components h (Q ) Proposition 4. Maximization of 2 2 subject to monotonically x Q = I PCA-cor of the original variables . In addition, with 2 it is also the decreasing ψi is achieved by using W as the sphering unique whitening transformation exhibiting a symmetric cross- matrix. Ψ correlation matrix . Q = Proof. Analogous to the proof of Proposition 3,wefind 2 T G to yield optimal and decreasing ψi and with Equation (5)we −1/2 T − / PCA-cor 6.3. PCA Whitening arrive at W = Θ G V 1 2 = W .  Another frequent aim in whitening is the generation of new Hence, the PCA-cor whitening transformation is the unique uncorrelated variables z that are useful for dimension reduction transformation that maximizes the integration, or compression, of and data compression. In other words, we would like to con- all components of the original vector x in each component of the struct components z1,...,zd such that the first few components sphered vector z employing the cross-correlation Ψ as underlying in z represent as much as possible the variation present in the all measure. original variables x1,...,xd. Onewaytoformalizethisistousetherowsumofsquared  φ = d φ2 = d ( , )2 6.5. Cholesky Whitening cross-covariances i j=1 ij j=1 cov zi x j between each individual zi and all x j as a measure of how effectively each Finally, we investigate the connection between Cholesky whiten- zi integrates, or compresses, the original variables. Note that ing and corresponding characteristics of the cross-covariance here, unlike in ZCA-Mahalanobis whitening, the objective func- and cross-correlation matrices. Unlike the other four whiten- tion links each component in z simultaneously with all com- ing methods listed in Table 1, which result from optimization, ponents in x.Invectornotationtheφi can be more elegantly Cholesky whitening is due to a symmetry constraint. THE AMERICAN STATISTICIAN 313

Specifically, the whitening matrix W Chol leads to a cross- Table . Whitening transforms applied to the iris flower dataset. Φ covariance matrix that is lower-triangular with positive diag- ZCA PCA Cholesky ZCA-cor PCA-cor onal elements as well as to a cross-correlation matrix Ψ with (z , x ) cor 1 1 . . . . . the same properties. This is a consequence of the Cholesky fac- (z , x ) cor 2 2 . . . . . torization with L being subject to the same constraint. Cru- (z , x ) cor 3 3 . . . . . (z , x ) cially, as L is unique the converse argument is valid as well, cor 4 4 . . . . . (Φ) and hence Cholesky whitening is the unique whitening proce- tr 2.9829 . . . . tr(Ψ) . . . 3.1914 . dure that results from lower-triangular positive diagonal cross- T max diag(ΦΦ ) . 4.2282 . . . covariance and cross-correlation matrices. T max diag(ΨΨ ) . . . . 2.9185 A consequence of using Cholesky factorization for whitening is that we implicitly assume an ordering of the variables. This NOTE: Bold font indicates best whitening transformation, and italic font the second can be useful specifically in time course analysis to account for best method for each considered criterion (lines –). auto-correlation (see Pourahmadi 2011, and references therein). As expected, the ZCA and the ZCA-cor whitening produce sphered variables that are most correlated to the original data 7. Application on a component-wise level, with the former achieving the best fit for the covariance-based and the latter for the correlation-based 7.1. Data-Based Whitening objective. In contrast, the PCA and PCA-cor methods are best at pro- Inthesectionsabove,wehavediscussedthetheoreticalback- ducing whitened variables that are maximally simultaneously ground of whitening in terms of random variables x and z and linked with all components of the original variables. Conse- using the population covariance Σ to guide the construction of quently,ascanbeseenfromthetophalfofTable 2,forPCA W asuitablespheringmatrix . and PCA-cor whitening only the first two components z1 and z2 In practice, however, we frequently need to whiten data are highly correlated with their respective counterparts x1 and × rather than random variables. In this case, we have an n d x2,whereasthesubsequentpairsz3, x3 and z4, x4 are effectively X = ( ) data matrix xki whose rows areassumedtobedrawnfrom uncorrelated. Furthermore, the last line of Table 2 shows that a distribution with expectation μ and covariance matrix Σ.In PCA-cor whitening achieves higher maximum total squared this setting, the transformation of Equation (1)fromoriginalto correlation of the first component z1 with all components of x T whitened data matrix becomes Z = XW . than PCA whitening, indicating better compression. A further complication is that the covariance matrix Σ Interestingly, Cholesky whitening always assumes third place is often unknown. Accordingly, it needs to be learned from in the rankings, either behind ZCA and ZCA-cor whitening, or data, either from X or from another suitable dataset, yielding behind PCA and PCA-cor whitening. Moreover, it is the only Σ a covariance matrix estimate .Typically,forlargesample approach where by construction one pair (z4, x4)perfectlycor- size n and small dimension d,thestandardunbiasedempirical relates between whitened and original data. S = ( ) = 1 n ( − ¯ )( − ¯ ) covariance sij with sij n−1 k=1 xki xi xkj x j is used. In high-dimensional cases with p > n,theempirical estimator breaks down, and the covariance matrix needs to 8. Conclusion be estimated by a suitable regularized method instead (e.g., In this note we have investigated linear transformations for Schäfer and Strimmer 2005; Pourahmadi 2011). Finally, from whitening of random variables. These methods are commonly the spectral decomposition of the estimated covariance or employed in data analysis for preprocessing and to facilitate sub- corresponding correlation matrix, we then obtain the desired sequent analysis. estimated whitening matrix W . In principle, there are infinitely many possible whitening pro- cedures all satisfying the fundamental constraint of Equation (3) for the underlying whitening matrix. However, as we have 7.2. Iris Flower Data Example demonstrated here, the rotational freedom inherent in whiten- For an illustrative comparison of the five natural whitening ing can be broken by considering cross-covariance Φ and cross- transforms discussed in this article and listed in Table 1,we correlations Ψ between whitened and original variables. applied them on the well-known iris flower dataset of Ander- Specifically, we have studied five natural whitening trans- sonreportedinFisher(1936), which comprises d = 4correlated forms, see Table 1, all of which can be interpreted as either opti- variables (x1:sepallength,x2:sepalwidth,x3:petallength,x4: mizing a suitable function of Φ or Ψ,orsatisfyingasymmetry petal width) and n = 150 observations. constraint on Φ or Ψ. As a result, this not only leads to a bet- TheresultsareshowninTable 2 with all estimates based on ter understanding of the differences among whitening methods, the empirical covariance S. For the PCA and PCA-cor whiten- but also enables an informed choice. ing transformation, we have set diag(U )>0anddiag(G)>0, In particular, selecting a suitable whitening transformation respectively.TheupperhalfofTable 2 shows the estimated depends on the context of application, specifically whether min- cross-correlations between each component of the whitened imal adjustment or compression of data is desired. In the former, and original vector for the five methods, and the lower half the thewhitenedvariablesremainhighlycorrelatedtotheoriginal values of the various objective functions discussed above. variables, and thus maintain their original interpretation. This 314 A. KESSY, A. LEWIN, AND K. STRIMMER is advantageous, for example, in the context of variable selec- Fisher,R.A.(1936),“TheUseofMultipleMeasurementsinTaxonomic tionwhereonewouldliketounderstandtheresultingselected Problems,” Annals of Eugenics, 7, 179–188. [313] submodel. In contrast, in a compression context the whitened Friedman, J. H. (1987), “Exploratory Projection Pursuit,” Journal of the American Statistical Association, 82, 249–266. [311] variables by construction bear no interpretable relation to Garthwaite, P. H., Critchley, F., Anaya-Izquierdo, K., and Mubwandarikwa, the original data but instead reflect their intrinsic effective E. (2012), “Orthogonalization of Vectors with Minimal Adjustment,” dimension. Biometrika, 99, 787–798. [312] In general, we advocate using scale-invariant optimality Genizi, A. (1993), “Decomposition of R2 in Multiple Regression with Cor- functions and thus recommend using cross-correlation Ψ as a related Regressors,” Statistica Sinica, 3, 407–420. [312] Hao, N., Dong, B., and Fan, J. (2015), “Sparsifying the Fisher Linear Dis- basis for optimization. Consequently, we particularly endorse criminant by Rotation,” JournaloftheRoyalStatisticalSociety,Series two specific whitening approaches. If the aim is to obtain B, 77, 827–851. [309] sphered variables that are maximally similar to the original Johnson, R. M. (1966), “The Minimal Transtransform to Orthonormality,” ones, we suggest to employ the ZCA-cor whitening procedure Psychometrika, 31, 61–66. [312] of Equation (11). Conversely, if maximal compression is desir- Jolliffe, I. T. (2002), Principal Component Analysis (2nd ed.), New York: Springer. [312] able we recommend to use the PCA-cor whitening approach of Li,G.,andZhang,J.(1998), “Sphering and its Properties,” Sankhya A, 60, Equation (12). 119–133. [310] Pourahmadi, M. (2011), “Covariance Estimation: the GLM and Regulariza- tion Perspectives,” Statistical Science, 26, 369–387. [313] Schäfer, J., and Strimmer, K. (2005), “A Shrinkage Approach to Large- References Scale Covariance Matrix Estimation and Implications for Functional Genomics,” Statistical Applications in Genetics and Molecular Biology, Ahdesmäki,M.,andStrimmer,K.(2010), “Feature Selection in Omics Pre- 4, Article 32. [313] diction Problems Using Cat Scores and False Non-Discovery Rate Con- Zuber,V.,DuarteSilva,A.P.,andStrimmer,K.(2012), “A Novel Algorithm trol,” Annals of Applied Statistics, 4, 503–519. [311] for Simultaneous SNP Selection in High-Dimensional Genome-Wide Bell, A. J., and Sejnowski, T. J. (1997), “The “Independent Components” Association Studies,” BMC Bioinformatics, 13, Article 284. [311] of Natural Scenes are Edge Filters,” Vision Research, 37, 3327–3338. Zuber, V., and Strimmer, K. (2009), “Gene Ranking and Biomarker Discov- [310] ery Under Correlation,” Bioinformatics, 25, 2700–2707. [309,311] Eldar, Y. C., and Oppenheim, A. V. (2003), “MMSE Whitening and ——— ( 2011), “High-Dimensional Regression and Variable Selection using Subspace Whitening,” IEEE Transactions on Information Theory, 49, CAR Scores,” Statistical Applications in Genetics and Molecular Biology, 1846–1851. [311] 10, Article 34. [311]