Estimation of Covariance Matrix

Estimation of Covariance Matrix Estimation of population covariance matrices from samples of multivariate data is important. (1) Estimation of principle components and eigenvalues. (2) Construction of linear discriminant functions. (3) Establishing independence and conditional independence. (4) Setting confidence intervals on linear functions. Suppose we observed p dimensional multivariate samples X1;X2; ¢ ¢ ¢ ;Xn i.i.d. with mean 0 and covariance matrix Σp, and write 0 Xi = (Xi1;Xi2; ¢ ¢ ¢ ;Xip) : Our goal is to estimate Σp. For simplicity, we first consider the Gaussian case, where Xi » N(0; Σp). When p is fixed and does not depend on n, the empirical covariance matrix is a good estimator, i.e. 1 Xn Σˆ = (X ¡ X¯)(X ¡ X¯)T : p n i i i=1 ¡1=2 In this case, Σˆ p is a consistent estimator and the rate of convergence is n , which is optimal. But when p is large (larger than n, or depend on n), this estimator could be very bad. If p=n ! c 2 (0; 1) and the covariance matrix Σp = I, then the empirical distribution of p 2 p 2 the eigenvalues of Σˆ p is supported on ((1 ¡ c) ; (1 + c) ). Thus the larger p=n, the more spread out the eigenvalues. This means in terms of k ¢ k2 norm, Σˆ p is not consistent. 1 In fact, when p > we need to estimate p £ p parameters based on n £ p observations. So to get a good estimator the covariance matrix Σp must have some special form. In practise, one of the most important set of covariance matrices is X ¡® F (²; ®; C) = fΣ = (σij) : max jσijj · Ck for all k > 0; j i;ji¡jj>k and 0 < ² · eigenvalues(Σ) · 1=²g: This is the set of matrices that decay on the off diagonal direction. There are several methods of estimating the covariance matrix of this type. We first introduce the banding method. 1 Banding methods To evaluate the performance of an estimator, we will use the matrix l2 norm. Let us first introduce the estimation procedures. 1.1 Banding the covariance matrix For any matrix M = (mij)p£p and any 0 · k < p, define, Bk(M) = (mijI(ji ¡ jj · k)): Then we can estimate the covariance matrix by Σˆ k;p = Bk(Σˆ p) for some k. Theorem 1 If k / (n¡1 log p)¡1=(2(®+1)), µ ¶ log p kΣˆ ¡ Σ k = O ( )®=(2(®+1)) = kΣˆ ¡1 ¡ Σ¡1k (1) k;p p 2 p n k;p p 2 uniformly on F (²; ®; C). 2 ˆ 0 Proof. For any matrix M, let jMj denote the maximum absolute entry of M and Σp = 1 Pn T n i=1 Xi Xi, then ˆ 0 ˆ 0 ˆ 0 kBk(Σp) ¡ Bk(Σp)k2 = Op(kBk(Σp) ¡ Bk(Σp)k1) = Op(kjBk(Σp) ¡ Bk(Σp)j): Then from lemma 1 below we know that ˆ 0 2 P (jBk(Σp) ¡ Bk(Σp)j ¸ t) · (2k + 1)p expf¡nt γ(²; ¸)g; for jtj · ¸(²). By choosing t = M(log(pk)=n)1=2 we conclude that, uniformly on F , ˆ 0 ¡1 1=2 jBk(Σp) ¡ Bk(Σp)j = Op((n log p) ): On the other hand, ¡® jBk(Σp) ¡ Σpj · Ck for Σp 2 F . And ˆ 0 ˆ ¯ T ¯ ¯ 2 kBk(Σp) ¡ Bk(Σp)k2 · kBk(X X)k2 · (2k + 1) max jXjj = Op(k log p=n); j T where X¯ = (X¯1; ¢ ¢ ¢ ; X¯p) . Putting them together, we have the result. ¡1 Lemma 1 Let Zi be i.i.d. N(0; Σp) and ¸max(Σp) · ² < 1. Then Xn 2 P (j (ZijZik ¡ σjk)j ¸ nv) · C1 exp(¡C2nv ) i=1 for jvj · ±, where C1, C2 and ± depend on ² only. Proof. Let Wi = ZijZik ¡ σjk then Wi are i.i.d. random variables with E(Wi) = 0 and 2 V ar(Wi) = σjjσkk + 2σjk. From general large deviation result, the lemma is proved. 1.2 Banding the inverse In the previous section, we estimate the covariance matrix by banding the empirical covariance matrix. This estimator has some nice properties, but it is not guaranteed to be a 3 positive definite matrix and hence may not be the ideal estimator in some applications. In this part, we will introduce a procedure that gives us a positive definite estimator. This procedure is based on the Cholesky decomposition of the covariance matrix. T Suppose Z = (Z1;Z2; ¢ ¢ ¢ ;Zp) » N(0; Σp). Then for any 1 < j · p we have Xj¡1 E(ZjjZ1;Z2; ¢ ¢ ¢ ;Zj¡1) = aj;iZi; i=1 where the coefficients can be compute as T aj = (aj;1; ¢ ¢ ¢ ; aj;j¡1) ¡1 T = (V ar(Z1;Z2; ¢ ¢ ¢ ;Zj¡1)) (Cov(Z1;Zj); Cov(Z2;Zj); ¢ ¢ ¢ ; Cov(Zj¡1;Zj)) : Let the lower triangular matrix A with zeros on the diagonal contain the coefficients aj 2 arranged in the rows. Let ²j = Zj ¡E(ZjjZ1;Z2; ¢ ¢ ¢ ;Zj¡1) (assume ²1 = Z1), dj = V ar(²j) 2 2 and let D = diag(d1; ¢ ¢ ¢ ; dp). Then by calculating Cov[(I ¡ A)X], we know that ¡1 ¡1 T Σp = (I ¡ A) D[(I ¡ A) ] ; ¡1 T ¡1 Σp = (I ¡ A) D (I ¡ A): Now we can approximate the covariance matrix Σp by banding A and changing D. Suppose k < p, instead calculating E(ZjjZ1;Z2; ¢ ¢ ¢ ;Zj¡1), we calculate E(ZjjZmax(j¡k;1);Z2; ¢ ¢ ¢ ;Zj¡1), k i.e. project Zj onto the space generated by the closest k predecessors. Let aj denote the 2 2 coefficients and Dk = diag(dj;k, where dj;k = V ar(Zj ¡ E(ZjjZmax(j¡k;1);Z2; ¢ ¢ ¢ ;Zj¡1)). ¡1 Then the k-banded approximation Σp;k and Σp;k are obtained by replacing A, D by Ak and Dk. 4 For the giving samples Xi = (Xi1; ¢ ¢ ¢ ;Xip), i = 1; 2; ¢ ¢ ¢ ; n. The nature estimates of Ak and Dk are obtained by the ordinary least square estimates. Plugging them into the ¡1 ˜ ˜ ¡1 previous equation gives the final estimates of Σp and Σp , which we refer to as Σp;k and Σp;k. ˜ ¡1 ˜ Note that both the estimators are positive definite. Σp;k is k-banded but Σp;k is gener- ally not banded. ¡1 T ¡1 Now for any covariance matrix Σ, if Σ = T (Σ) D T (Σ) with T (Σ) = [tij(Σ)] lower triangular, let X ¡1 ¡® F (²; ®; C) = fΣ : max jtij(Σ)j · Ck for all 0 < k < p; i j<i¡k and 0 < ² · eigenvalues(Σ) · 1=²g: We have the following result. Theorem 2 Uniformly on F ¡1(²; ®; C), if k / (n¡1 log p)¡1=2(®+1) and n¡1 log p = o(1), log p kΣ˜ ¡1 ¡ Σ¡1k = O (( )®=2(®+1)): p;k p 2 p n Proof. The proof is similar to the proof of previous theorem. We only need to check that log p kΣ˜ ¡1 ¡ Σ¡1k = O (( )1=2); p;k p;k 1 p n and ¡1 ¡1 ¡® kΣp;k ¡ Bk(Σp )k2 = O(k ): Lemma 2 Under condition of theorem 2, uniformly on F ¡1, log p maxfka˜k ¡ akk : 1 · j · pg = O (( )1=2); j j 1 p n 5 log p maxfjd˜2 ¡ d2 j : 1 · j · pg = O (( )®=2(®+1)); j;k j;k p n and ¡1 kAkk2 = kDk k2 = O(1): 1.3 Choice of k An intuitive way to choose the banding parameter k is to minimize the risk R(k) = EkΣˆ k ¡ Σk1; with the oracle k be the minimizer of R(k). In practise, this can be ”achieved” by cross validation. Randomly split the sample into two groups and use the sample covariance matrix of one sample (with sample size about 2n=3) as the target to choose k. 6.

Load more