Quick viewing(Text Mode)

Estimation of Covariance Matrix

Estimation of

Estimation of population covariance matrices from samples of multivariate data is impor- tant.

(1) Estimation of principle components and eigenvalues.

(2) Construction of linear discriminant functions.

(3) Establishing independence and conditional independence.

(4) Setting confidence intervals on linear functions.

Suppose we observed p dimensional multivariate samples X1,X2, ··· ,Xn i.i.d. with 0 and Σp, and write

0 Xi = (Xi1,Xi2, ··· ,Xip) .

Our goal is to estimate Σp. For simplicity, we first consider the Gaussian case, where

Xi ∼ N(0, Σp).

When p is fixed and does not depend on n, the empirical covariance matrix is a good estimator, i.e. 1 Xn Σˆ = (X − X¯)(X − X¯)T . p n i i i=1 −1/2 In this case, Σˆ p is a consistent estimator and the rate of convergence is n , which is optimal.

But when p is large (larger than n, or depend on n), this estimator could be very bad.

If p/n → c ∈ (0, 1) and the covariance matrix Σp = I, then the empirical distribution of

√ 2 √ 2 the eigenvalues of Σˆ p is supported on ((1 − c) , (1 + c) ). Thus the larger p/n, the more spread out the eigenvalues. This in terms of k · k2 norm, Σˆ p is not consistent.

1 In fact, when p > we need to estimate p × p parameters based on n × p observations. So to get a good estimator the covariance matrix Σp must have some special form. In practise, one of the most important set of covariance matrices is

X −α F (², α, C) = {Σ = (σij) : max |σij| ≤ Ck for all k > 0, j i,|i−j|>k and 0 < ² ≤ eigenvalues(Σ) ≤ 1/²}.

This is the set of matrices that decay on the off diagonal direction. There are several methods of estimating the covariance matrix of this type. We first introduce the banding method.

1 Banding methods

To evaluate the performance of an estimator, we will use the matrix l2 norm. Let us first introduce the estimation procedures.

1.1 Banding the covariance matrix

For any matrix M = (mij)p×p and any 0 ≤ k < p, define,

Bk(M) = (mijI(|i − j| ≤ k)).

Then we can estimate the covariance matrix by Σˆ k,p = Bk(Σˆ p) for some k.

Theorem 1 If k ∝ (n−1 log p)−1/(2(α+1)), µ ¶ log p kΣˆ − Σ k = O ( )α/(2(α+1)) = kΣˆ −1 − Σ−1k (1) k,p p 2 p n k,p p 2 uniformly on F (², α, C).

2 ˆ 0 Proof. For any matrix M, let |M| denote the maximum absolute entry of M and Σp =

1 Pn T n i=1 Xi Xi, then

ˆ 0 ˆ 0 ˆ 0 kBk(Σp) − Bk(Σp)k2 = Op(kBk(Σp) − Bk(Σp)k∞) = Op(k|Bk(Σp) − Bk(Σp)|).

Then from lemma 1 below we know that

ˆ 0 2 P (|Bk(Σp) − Bk(Σp)| ≥ t) ≤ (2k + 1)p exp{−nt γ(², λ)}, for |t| ≤ λ(²). By choosing t = M(log(pk)/n)1/2 we conclude that, uniformly on F ,

ˆ 0 −1 1/2 |Bk(Σp) − Bk(Σp)| = Op((n log p) ).

On the other hand,

−α |Bk(Σp) − Σp| ≤ Ck for Σp ∈ F . And

ˆ 0 ˆ ¯ T ¯ ¯ 2 kBk(Σp) − Bk(Σp)k2 ≤ kBk(X X)k2 ≤ (2k + 1) max |Xj| = Op(k log p/n), j

T where X¯ = (X¯1, ··· , X¯p) . Putting them together, we have the result.

−1 Lemma 1 Let Zi be i.i.d. N(0, Σp) and λmax(Σp) ≤ ² < ∞. Then

Xn 2 P (| (ZijZik − σjk)| ≥ nv) ≤ C1 exp(−C2nv ) i=1 for |v| ≤ δ, where C1, C2 and δ depend on ² only.

Proof. Let Wi = ZijZik − σjk then Wi are i.i.d. random variables with E(Wi) = 0 and

2 V ar(Wi) = σjjσkk + 2σjk. From general large deviation result, the lemma is proved.

1.2 Banding the inverse

In the previous section, we estimate the covariance matrix by banding the empirical co- matrix. This estimator has some nice properties, but it is not guaranteed to be a

3 positive definite matrix and hence may not be the ideal estimator in some applications.

In this part, we will introduce a procedure that gives us a positive definite estimator.

This procedure is based on the of the covariance matrix.

T Suppose Z = (Z1,Z2, ··· ,Zp) ∼ N(0, Σp). Then for any 1 < j ≤ p we have

Xj−1 E(Zj|Z1,Z2, ··· ,Zj−1) = aj,iZi, i=1 where the coefficients can be compute as

T aj = (aj,1, ··· , aj,j−1)

−1 T = (V ar(Z1,Z2, ··· ,Zj−1)) (Cov(Z1,Zj), Cov(Z2,Zj), ··· , Cov(Zj−1,Zj)) .

Let the lower A with zeros on the diagonal contain the coefficients aj

2 arranged in the rows. Let ²j = Zj −E(Zj|Z1,Z2, ··· ,Zj−1) (assume ²1 = Z1), dj = V ar(²j)

2 2 and let D = diag(d1, ··· , dp). Then by calculating Cov[(I − A)X], we know that

−1 −1 T Σp = (I − A) D[(I − A) ] ,

−1 T −1 Σp = (I − A) D (I − A).

Now we can approximate the covariance matrix Σp by banding A and changing D. Suppose k < p, instead calculating E(Zj|Z1,Z2, ··· ,Zj−1), we calculate E(Zj|Zmax(j−k,1),Z2, ··· ,Zj−1),

k i.e. project Zj onto the space generated by the closest k predecessors. Let aj denote the

2 2 coefficients and Dk = diag(dj,k, where dj,k = V ar(Zj − E(Zj|Zmax(j−k,1),Z2, ··· ,Zj−1)).

−1 Then the k-banded approximation Σp,k and Σp,k are obtained by replacing A, D by Ak and

Dk.

4 For the giving samples Xi = (Xi1, ··· ,Xip), i = 1, 2, ··· , n. The estimates of

Ak and Dk are obtained by the ordinary least square estimates. Plugging them into the

−1 ˜ ˜ −1 previous equation gives the final estimates of Σp and Σp , which we refer to as Σp,k and Σp,k.

˜ −1 ˜ Note that both the estimators are positive definite. Σp,k is k-banded but Σp,k is gener- ally not banded.

−1 T −1 Now for any covariance matrix Σ, if Σ = T (Σ) D T (Σ) with T (Σ) = [tij(Σ)] lower triangular, let

X −1 −α F (², α, C) = {Σ : max |tij(Σ)| ≤ Ck for all 0 < k < p, i j

We have the following result.

Theorem 2 Uniformly on F −1(², α, C), if k ∝ (n−1 log p)−1/2(α+1) and n−1 log p = o(1),

log p kΣ˜ −1 − Σ−1k = O (( )α/2(α+1)). p,k p 2 p n

Proof. The proof is similar to the proof of previous theorem. We only need to check that

log p kΣ˜ −1 − Σ−1k = O (( )1/2), p,k p,k ∞ p n and

−1 −1 −α kΣp,k − Bk(Σp )k2 = O(k ).

Lemma 2 Under condition of theorem 2, uniformly on F −1,

log p max{ka˜k − akk : 1 ≤ j ≤ p} = O (( )1/2), j j ∞ p n

5 log p max{|d˜2 − d2 | : 1 ≤ j ≤ p} = O (( )α/2(α+1)), j,k j,k p n and

−1 kAkk2 = kDk k2 = O(1).

1.3 Choice of k

An intuitive way to choose the banding parameter k is to minimize the risk

R(k) = EkΣˆ k − Σk1, with the oracle k be the minimizer of R(k).

In practise, this can be ”achieved” by cross validation. Randomly split the into two groups and use the sample covariance matrix of one sample (with sample size about

2n/3) as the target to choose k.

6