<<

Robust Shrinkage Estimation of High-dimensional Covariance Matrices 1Yilun Chen, 2Ami Wiesel, 1Alfred O. Hero, III 1 Department of EECS, University of Michigan, Ann Arbor, USA 2 Department of CS, Hebrew University, Jerusalem, Israel {yilun,hero}@umich.edu [email protected]

Abstract—We address high dimensional covariance estimation In this paper, we model the samples using the elliptical for elliptical distributed samples. Specifically we consider shrink- distribution; a flexible and popular alternative that encom- age methods that are suitable for high dimensional problems passes a large number of important non-Gaussian distributions with a small number of samples (large p small n). We start from a classical robust covariance estimator [Tyler(1987)], which in signal processing and related fields, e.g., [9], [13], [16]. is distribution-free within the family of elliptical distribution A well-studied covariance estimator in this setting is the ML but inapplicable when n < p. Using a shrinkage coefficient, we estimator based on normalized samples [7], [16]. The estimator regularize Tyler’s fixed point iteration. We derive the minimum is derived as the solution to a fixed point equation. It is mean-squared-error shrinkage coefficient in closed form. The distribution-free within the class of elliptical distributions and closed form expression is a function of the unknown true covariance and cannot be implemented in practice. Instead, its performance advantages are well known in the n  p we propose a plug-in estimate to approximate it. Simulations regime. However, it is not suitable for the “large p small demonstrate that the proposed method achieves low estimation n” setting. Indeed, when n < p, the ML estimator does not error and is robust to heavy-tailed samples. even exist. To avoid this problem the authors of [10] propose an iterative regularized ML estimator that employs diagonal loading and use a heuristic for selecting the regularization I.INTRODUCTION parameter. They empirically demonstrated that their algorithm Estimating a covariance matrix (or a dispersion matrix) has superior performance in the context of a radar application. is a fundamental problem in statistical signal processing. Many techniques for detection and estimation rely on accurate estimation of the true covariance. In recent years, estimating a Our approach is similar to [10] but we propose a a system- high dimensional p × p covariance matrix under small sample atical choice of the regularization parameter. We consider a size n has attracted considerable attention. In these “large p shrinkage estimator that regularizes the fixed point iterations small n” problems, the classical sample covariance suffers of the ML estimator. Following Ledoit-Wolf [2], we provide from a systematically distorted eigen-structure, and improved a simple closed-form expression for the minimum mean- estimators are required. squared-error shrinkage coefficient. This clairvoyant coeffi- Many efforts have been devoted to high-dimensional co- cient is a function of the unknown true covariance and cannot estimation, which use Steinian shrinkage [1]–[3] or be implemented in practice. Instead, we develop a “plug-in” other types of regularized methods such as [4], [5]. However, estimate to approximate it. Simulation results demonstrate that most of the high-dimensional estimators assume Gaussian the our estimator achieves superior performance for samples distributed samples. This limits their usage in many important distributed within the elliptical family. Furthermore, for the applications involving non-Gaussian and heavy-tailed sam- case that the samples are truly Gaussian, we report very ples. One exception is the Ledoit-Wolf estimator [2], where little performance degradation with respect to the shrinkage the authors shrink the sample covariance towards a scaled estimators designed specifically for Gaussian samples [3]. identity matrix and proposed a shrinkage coefficient which is asymptotically optimal for any distribution. However, as the Ledoit-Wolf estimator operates on the sample covariance, it is The paper is organized as follows. Section II provides a inappropriate for heavy tailed non-Gaussian distributions. On brief review of elliptical distributions and Tyler’s covariance the other hand, traditional robust covariance estimators [6]–[8] estimation method. The regularized estimator is introduced and designed for non-Gaussian samples generally require n  p derived in Section III. We provide simulations in Section IV and are not suitable for “large p small n” problems. Therefore, and conclude the paper in Section V. the goal of our work is to develop a covariance estimator for both problems that are high dimensional and non-Gaussian. Notations: In the following, we depict vectors in lowercase This work was partially supported by AFOSR, grant number FA9550-06- boldface letters and matrices in uppercase boldface letters. (·)T 1-0324. The work of A. Wiesel was supported by a Marie Curie Outgoing International Fellowship within the 7th European Community Framework denotes the transpose operator. Tr(·) and det(·) are the trace Programme. and the determinant of a matrix, respectively. II.ML COVARIANCE ESTIMATION FOR ELLIPTICAL final normalization step is needed, which ensures the iteration DISTRIBUTIONS limit Σb ∞ satisfies Tr(Σb ∞) = p. A. Elliptical distribution The ML estimate corresponds to the Huber-type M- estimator and has many good properties when n  p, such x p×1 Let be a zero-mean random vector generated by the as asymptotic normality and strong consistency. Furthermore, following model it has been pointed out [7] that the ML estimate is the “most x = Ru, (1) robust” covariance estimator in the class of elliptical distri- where R is a positive random variable and u is a p × 1 zero- butions in the sense of minimizing the maximum asymptotic mean, jointly Gaussian random vector with positive definite variance. covariance Σ. We assume that R and u are statistically independent. The resulting random vector x is elliptically III.ROBUST SHRINKAGE COVARIANCE ESTIMATION distributed. Here we extend Tyler’s method to the high dimensional The elliptical family encompasses many useful distributions setting using shrinkage. It is easy to see that there is no in signal processing and related fields and includes: the solution to (5) when n < p (its left-hand-side is full rank Gaussian distribution itself, the K distribution, the Weibull whereas it right-hand-side of is rank deficient). This motivates distribution and many others. Elliptically distributed samples us to develop a regularized covariance estimator for elliptical are also referred to as Spherically Invariant Random Vectors samples. Following [2], [3], we propose to regularize the fixed (SIRV) or compound Gaussian vectors in signal processing point iterations as and have been used in various applications such as band- n T limited speech signal models, radar clutter echo models [9], p X si si Σb j+1 = (1 − ρ) + ρI (7) n T −1 and wireless fading channels [13]. i=1 si Σb j si

Σb j+1 B. ML estimation Σb j+1 = , (8) Tr(Σb j+1)/p Let {x }n be a set of n independent and identically i i=1 where ρ is the so-called shrinkage coefficient, which is a distributed (i.i.d.) samples drawn according to (1). The prob- constant between 0 and 1. When ρ = 0 the proposed shrinkage lem is to estimate the covariance (dispersion) matrix Σ from estimator reduces to Tyler’s unbiased method in (5) and when {x }n . To remove the scale ambiguity caused by R we i i=1 ρ = 1 the estimator reduces to the trivial estimator Σ = I. further constrain that Tr(Σ) = p. The term ρI ensures that Σb j+1 is always well-conditioned The commonly used sample covariance, defined as and thus allows continuation of the iterative process without n 1 X T restarts. Therefore, the proposed iteration can be applied Sb = xix , (2) n i to high dimensional estimation problems. We note that the i=1 normalization (8) is important and necessary for convergence. is known to be a poor estimator of Σ, especially when the We now turn to the problem of choosing a good, data- samples are high-dimensional and/or heavy-tailed. dependent, shrinkage coefficient ρ. Following Ledoit-Wolf [2], Tyler’s method [7], [16] addressed this problem by working we begin by assuming we “know” the true covariance Σ. The with the normalized samples: optimal ρ that minimizes the minimum mean-squared error is x called the “oracle” coefficient and is s = i , (3) i  2  kxik2 ρO = arg min E Σe (ρ) − Σ , (9) ρ F for which the term R in (1) drops out. The pdf of si is given by [12] where Σe (ρ) is defined as −p/2 Γ(p/2) p −1 T −1  n T p(si; Σ) = · det(Σ ) · si Σ si . (4) p X sis 2πp/2 Σe (ρ) = (1 − ρ) i + ρI. (10) n sT Σ−1s n i=1 i i and the maximum likelihood estimator based on {si}i=1 is the solution to There is a closed-form solution to the problem (9) which is n T p X sis provided in the following theorem. Σ = · i . (5) n sT Σ−1s i=1 i i Theorem 1. For i.i.d. elliptical distributed samples, the solu- When n ≥ p, the ML estimator can be found using the tion to (9) is following fixed point iterations: Tr2(Σ) + (1 − 2/p)Tr(Σ2) ρO = . n T 2 2 2 p X sisi (1 − n/p − 2n/p )Tr (Σ) + (n + 1 + 2(n − 1)/p)Tr(Σ ) Σb j+1 = · , (6) (11) n T −1 i=1 si Σb j si Proof: To ease the notation we define Ce as where the initial Σ is usually set to the identity matrix. It can b 0 n T p X sis be shown [7], [16] that the iteration process in (6) converges Ce = i . (12) n sT Σ−1s and does not depend on the initial setting of Σb 0. In practice a i=1 i i The shrinkage “estimator” in (10) is then For m2 there is

  n n  Σe (ρ) = (1 − ρ)Ce + ρI. (13) p2  X X  m = E Tr Λ z zT ΛT Λ z zT ΛT 2 n2  i i j j  By substituting (13) into (10) and taking derivatives of ρ, we  i=1 j=1  obtain that    2  n n  n  o p X X T T T T (28) E Tr (I − Ce )(Σ − Ce ) = 2 E Tr  zizi Λ Λzjzj Λ Λ ρ = n   O  2  i=1 j=1 E I − C 2 n n e (14) p X X n 2o F = E zT Dz  . m − m − m + p n2 i j = 2 11 12 , i=1 j=1 m2 − 2m11 + p Now substitute (22) and (23) to (28): where n 2 o p2  n n(n − 1)  m2 = E Tr(Ce ) , (15) m = 2Tr(Σ2) + Tr2(Σ) + Tr(Σ2) 2 n2 p(p + 2) p2 n o  1 2  Tr2(Σ) m11 = E Tr(Ce ) , (16) = 1 − + Tr(Σ2) + . n n(1 + 2/p) n(1 + 2/p) and (29) n o Recalling Tr(Σ) = p, (11) is finally obtained by substituting m12 = E Tr(CΣe ) . (17) (26), (27) and (29) into (14). Next, we calculate the moments. We begin by eigen- The oracle cannot be implemented since ρO is a function of decomposing Σ as the unknown true covariance Σ. We propose to use a plug-in T Σ = UDU , (18) estimate for ρO determined as follows: and denote Tr2(Rb ) + (1 − 2/p)Tr(Rb 2) 1/2 ρˆ = , Λ = UD . (19) (1 − n/p − 2n/p2)Tr2(Rb ) + (n + 1 + 2(n − 1)/p)Tr(Rb 2) (30) Then, we define where Rb is the normalized sample covariance: Λ−1s Λ−1u z = i = i . (20) n i −1 −1 1 X T kΛ sik2 kΛ uik2 Rb = sis . (31) n i i=1 Noting that ui is a Gaussian distributed random vector with covariance Σ, it is easy to see that kzik2 = 1 and zi and Given the plug-in estimate ρˆ, the robust shrinkage estimator zj are independent if i 6= j. Furthermore, zi is isotropically is computed using the fixed point iteration in (7) and (8). distributed [3] and satisfies

 T 1 IV. SIMULATIONS E zizi = I, (21) p In this section we use simulations to demonstrate the n 2o 1 superior performance of the proposed approach. First we show E zT Dz  = 2Tr(D2) + Tr2(D) i i p(p + 2) that our estimator outperforms other estimators in the case (22) 1 of heavy-tailed samples generated by a multivariate Student- = 2Tr(Σ2) + Tr2(Σ) , p(p + 2) T distribution. The degree-of-freedom of this multivariate Student-T is set to 3. The dimensionality p is chosen to be and 100 and an autoregressive covariance structured Σ is used. n 2o 1 1 Σ E zT Dz  = Tr(D2) = Tr(Σ2), i 6= j. (23) We let be the covariance matrix of an AR(1) process, i j p2 p2 Σ(i, j) = r|i−j|, (32) Expressing Ce in terms of zi, there is where Σ(i, j) denotes the entry of Σ in row i and column n p X T T j, and r is set to 0.7 in this simulation. The sample size n Ce = Λ ziz Λ . (24) n i varies from 5 to 225 with step size 10. All the simulations are i=1 repeated for 100 trials and the results are averaged. Then, n For comparison, we also plot the results of the closed-form n o p X  T T E Ce = Λ E ziz Λ = Σ, (25) oracle in (11), the Ledoit-Wolf estimator [2], and the non- n i i=1 regularized solution in (6) when n > p. As the Ledoit-Wolf and accordingly we have estimator operates on the sample covariance which is sensitive to heavy-tails, we also compare the Ledoit-Wolf estimator with n o m11 = E Tr(Ce ) = Tr(Σ), (26) known R in (1), where the samples xi are firstly normalized by those known realizations Ri, yielding truly Gaussian samples, and and followed by implementation of the Ledoit-Wolf estimator. n o 2 m12 = E Tr(CΣe ) = Tr(Σ ). (27) The MSE of those estimators are plotted in Fig. 1. It can be observed that the proposed method performs significantly V. CONCLUSION better than the Ledoit-Wolf estimator, and the performance In this paper, we proposed a shrinkage covariance estima- is very close to the ideal oracle. Even the Ledoit-Wolf with tor which is robust over the class of elliptically distributed known Ri does not exceed the proposed estimator for small samples. The proposed estimator is obtained by fixed point sample size. These results demonstrate the robustness of the iterations, and we established a systematical approach to proposed approach. choosing the shrinkage coefficient, which was derived using a When n > p, we also observe a substantial improvement minimum mean-squared-error framework and has a closed- of the proposed method over the ML estimate, which demon- form expression in terms of the unknown true covariance. strates the power of Steinian shrinkage in reducing the MSE. This expression can be well approximated by a simple plug-in estimator. Simulations suggest that the proposed estimator is

0.035 robust to heavy-tailed multivariate Student-T samples. Further- Oracle more, we show that for the Gaussian case, the proposed esti- 0.03 Proposed Ledoit−Wolf ML mator performs very closely to previous estimators designed 0.025 Ledoit−Wolf (with known R) expressly for Gaussian samples. 0.02 We did not give a proof that the regularized fixed point

MSE 0.015 iteration (7) and (8) converges. The proof of convergence uses ideas from concave Perron-Frobenius theory [18] and further 0.01 details will be provided in the full length journal version of 0.005 this paper.

0 0 50 100 150 200 250 n REFERENCES

Fig. 1. Multivariate Student-T samples: Comparison of different covariance [1] C. Stein, “Estimation of a covariance matrix,” In Rietz Lecture, 39th estimators when p = 100. Annual Meeting, IMS, Atlanta, GA, 1975. [2] O. Ledoit, M. Wolf, “A Well-Conditioned Estimator for Large- Dimensional Covariance Matrices”, Journal of Multivariate Analysis, In order to asses the tradeoff between accuracy and ro- vol. 88, iss. 2, Feb. 2004. [3] Y. Chen, A. Wiesel, Y. C. Eldar, A. O. Hero III, “Shrinkage Algorithms bustness we investigate the case when the samples are truly for MMSE Covariance Estimation ,” to appear in IEEE Trans. on Sig. Gaussian distributed. We use the same setting as in the pre- Process. vious example and the only difference is that the samples are [4] J. Friedman, T. Hastie, R. Tibshirani, “Sparse inverse covariance esti- mation with the graphical ,” Biostatistics, 2008. now generated from a Gaussian distribution. The performance [5] P. Bickel, E. Levina, “Regularized estimation of large covariance matri- comparison is shown in Fig. 2, where four different methods ces,” The Annals of , vol. 36, pp. 199-227, 2008. are included: the oracle estimator derived from Gaussian [6] P.J. Huber, “Robust statistics,” Wiley, 1981. [7] D.E. Tyler, “A distribution-free M-estimator of multivariate scatter,” The setting (Gaussian oracle) [3], the iterative approximation of Annals of Statistics, 1987. the Gaussian oracle (Gaussian OAS) [3], the Ledoit-Wolf [8] P. Rousseeuw, “Multivariate estimation with high breakdown point,”. estimator and the proposed method. It can be seen that for Mathematical Statistics and Applications, Reidel, 1985. [9] F. Gini, “Sub-optimum Coherent Radar Detection in a Mixture of Kdis- truly Gaussian samples the proposed method performs very tributed and Gaussian Clutter, IEE Proc. Radar, Sonar and Navigation, closely to the Gaussian OAS which is specifically designed vol. 144, no. 1, pp. 39-48, Feb. 1997. for Gaussian distributions. Indeed, for small sample size [10] B. A. Johnson, Y. L. Abramovich, “Diagonally loaded normalised sample matrix inversion (LNSMI) for outlier-resistant adaptive filtering,” (n < 20), the proposed method performs even better than IEEE Intl Conf. on Acoust., Speech, and Signal Processing, vol. 3, pp. the Ledoit-Wolf estimator. This indicates that although the 1105 - 1108, 2007. proposed method is developed for the entire elliptical family, [11] J.B. Billingsley, “Ground Clutter Measurements for Surface-Sited Radar, Technical Report 780, MIT, Feb. 1993. it actually sacrifices very little performance for the case that [12] G. Frahm, “Generalized Elliptical Distributions: Theory and Applica- the distribution is known to be Gaussian. tions,” Dissertation, 2004. [13] K. Yao, M.K. Simon and E. Biglieri, “A Unified Theory on Wireless Communication Fading Statistics based on SIRV, Fifth IEEE Workshop 0.03 Oracle on SP Advances in Wireless Communications, 2004. Proposed [14] A. L. Yuille, A. Rangarajan, “The concave-convex procedure”, NIPS 0.025 Gaussian OAS 2003. Ledoit−Wolf 0.02 [15] J. Wang, A. Dogandzic, A. Nehorai, “Maximum Likelihood Estimation of Compound-Gaussian Clutter and Target Parameters,” IEEE Trans. on 0.015 Sig. Process., vol. 54, no. 10, 2006. MSE [16] F. Pascal, Y. Chitour, J.-P. Ovarlez, P. Forster, and P. Larzabal, “Covari- 0.01 ance Structure Maximum-Likelihood Estimates in Compound Gaussian Noise: Existence and Algorithm Analysis,” IEEE Trans. on Sig. Process., 0.005 vol. 56, no. 1, Jan. 2008. [17] Y. Chitour, F. Pascal, “Exact Maximum Likelihood Estimates for SIRV 0 0 50 100 150 200 250 Covariance Matrix: Existence and Algorithm Analysis,” IEEE Trans. on n Sig. Process., vol. 56, no. 10, Oct. 2008. [18] U. Krause, “Concave perron-frobenius theory and applications, Nonlin- Fig. 2. Gaussian samples: Comparison of different covariance estimators ear Anal., vol. 47, no. 3, pp. 14571466, 2001. when p = 100.