Asymptotic Distribution of the Largest Off-Diagonal Entry of Correlation Matrices
Total Page:16
File Type:pdf, Size:1020Kb
TRANSACTIONS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 359, Number 11, November 2007, Pages 5345–5363 S 0002-9947(07)04192-X Article electronically published on May 11, 2007 ASYMPTOTIC DISTRIBUTION OF THE LARGEST OFF-DIAGONAL ENTRY OF CORRELATION MATRICES WANG ZHOU Abstract. Suppose that we have n observations from a p-dimensional pop- ulation. We are interested in testing that the p variates of the population are independent under the situation where p goes to infinity as n →∞.A test statistic is chosen to be Ln =max1≤i<j≤p |ρij |,whereρij is the sample correlation coefficient between the i-th coordinate and the j-th coordinate of the population. Under an independent hypothesis, we prove that the asymp- totic distribution of Ln is an extreme distribution of type G1,byusingthe Chen-Stein Poisson approximation method and the moderate deviations for sample correlation coefficients. As a statistically more relevant result, a limit distribution for ln =max1≤i<j≤p |rij |,whererij is Spearman’s rank corre- lation coefficient between the i-th coordinate and the j-th coordinate of the population, is derived. 1. Introduction Suppose (X, Y ), (Xi,Yi),i=1,...,nare independent and identically distributed (i.i.d.) random vectors from a bivariate distribution F (x, y). For measuring the closeness of two random variables X and Y , we use Pearson’s sample correlation coefficient, which is defined by n − − i=1(Xi X)(Yi Y ) (1.1) ρXY = , n − 2 n − 2 i=1(Xi X) i=1(Yi Y ) n n where X = i=1 Xi/n and Y = i=1 Yi/n. Obviously, ρXY is a natural estimator of the population correlation coefficient, ρ = E(X − EX)(Y − EY )/ E(X − EX)2(Y − EY )2. In the case of a bivariate normal distribution, ρXY is the maximum likelihood estimator of ρ. In addition, it is a commonly used statistic to test the null hypothesis that the two variables X, Y are independent, i.e., H0 : ρ =0, H1 : ρ =0 . Received by the editors April 25, 2005 and, in revised form, September 5, 2005. 2000 Mathematics Subject Classification. Primary 60F05, 62G20, 62H10. Key words and phrases. Sample correlation matrices, Spearman’s rank correlation matrices, Chen-Stein method, moderate deviations. The author was supported in part by grants R-155-000-035-112 and R-155-050-055-133/101 at the National University of Singapore. c 2007 American Mathematical Society Reverts to public domain 28 years from publication 5345 License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use 5346 WANG ZHOU Although many authors have studied the distribution of ρXY , see Hotelling [6] and the references therein, the best present-day usage in dealing with sample correlation coefficients is based on R. A. Fisher’s work published in 1915; see [3]. Nowsupposewehaveap-dimensional distribution with mean µ and covariance matrix Σ. In the last three or four decades, due to the fast development of modern technology, in many research areas, including signal processing, network security, image processing, genetics, stock marketing and other economic problems, we are faced with the task of analyzing data with ever increasing dimension p. Naturally, one may ask how to test the independence among the p components of the pop- ulation. For example, air pollutant levels are influenced by several meteorological conditions, such as wind speed, temperature, humidity, insolation, etc. The rela- tionship of air pollutant levels to meteorological conditions can be described by regression models. But an essential part of constructing regression models is to check that all the interested meteorological conditions are independent or uncorre- lated. Under the additional normality assumption, Johnstone [9] derived the asymptotic distribution of the largest eigenvalue of the sample covariance matrices to study the test hypothesis H0 :Σ=I assuming µ = 0. On the other hand, Jiang [7] proposed to use a more intuitive independence test statistic, the largest off-diagonal entry of the sample correlation matrices, to test H0 assuming at least a 30th moment of the population. Before proceeding further, let us introduce some notation. Let (Xk1,...,Xkp), k =1,...,n,beasampleofn independent observations taken from a nondegenerate T p-variate population. Let Xi =(X1i,...,Xni) , i =1,...,p, be columns of the × X X n p matrix n =(Xki), k =1 ,...,n,i=1,...,p. Hence n =(X1, X2,...,Xp). n − − Introduce the average Xi = k=1 Xki/n.WewriteXi Xi for Xi Xie,where T n e =(1, 1,...,1) ∈ R . Then, ρij, the sample correlation coefficient between Xi and Xj is defined by T (Xi − Xi) (Xj − Xj) (1.2) ρij = , 1 ≤ i<j≤ p, Xi − XiXj − Xj where ·is the Euclidean norm. If we assume that the columns of Xn are independent, all the ρij should be close to 0. In other words, the biggest off- diagonal entry of the sample correlation matrix Γn =(ρij), Ln =max|ρij|, 1≤i<j≤p has to be small. So we can use Ln as an independence test statistic. Under the i.i.d. assumption, Jiang [7] found an asymptotic distribution of Ln. Theorem A (Jiang, [7]). Suppose that E|X|30+ < ∞ for some >0.Ifn/p → γ, then as n →∞for any y ∈ R, √ 2 − ≤ → − 2 −1 −y/2 (1.3) P(nLn an y) exp (γ 8π) e , where an =4logn − log log n. Jiang’s result makes it possible to test independence of components of vectors in large dimensional data problems when strong structural assumptions on the population, such as normality, are not present. But the conditions of his theorem are too restrictive. For example, a (30+)th moment is needed to get the asymptotic License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use ASYMPTOTIC DISTRIBUTION OF LARGEST OFF-DIAGONAL ENTRY 5347 distribution of Ln. So is it possible to reduce the high order moment requirements? On the other hand, we have no guarantee that the ratio of the sample size to the dimension will converge in practice. Sometimes it may happen that the dimension increases much slower than the sample size. Clearly, one may ask what we can do in those situations. In this paper, we try to answer the above mentioned questions through the following theorem. Theorem 1.1. Assume that the entries of the matrix Xn =(Xki) are i.i.d. Let p = pn satisfy (1.4) lim p = ∞, lim sup p/n < ∞. n→∞ n→∞ Write b =4logp − log log p.Then n 2 − ≤ − −y/2 −1/2 (1.5) lim P(nLn bn y)=exp Ke ,K=(8π) , n→∞ provided that (1.6) lim x6P |XX | >x =0, x→∞ where X and X are independent copies of X11.Inparticular,(1.6) holds if (1.7) EX6 < ∞. We shall make the following remarks: (1). Using the Chebyshev inequality, we can see that the condition EX6 < ∞ is stronger than the weak moment assumption (1.6) for the product XX. Through Example 1 in Section 4, we explain that the weak moment condition (1.6) seems to be optimal. (2). The limit distributions in (1.5) and (1.3) are different. This is due to the different choice of centering constants. Our choice of centering constants has an advantage that the limit distribution is independent of the unknown parameters, which is really useful in statistical applications. Of course, under Jiang’s condi- tions, the relations (1.5) and (1.3) are mathematically equivalent, since in this case, limn→∞(an − bn)=4logγ. (3). To prove Theorem 1.1 we use the Poisson approximation technique, which was suggested by Barbour and Eagleson [2], and self-normalized moderate devia- tions studied by Shao [10], Wang and Jing [14], Shao [11], Jing, Shao and Wang [8]. Now we can see that the sample correlation coefficient is still useful in large dimensional statistical inference if high order moment assumptions are violated. But if strong structural assumptions on the population, such as normality, are not present, the sample correlation coefficient does not retain much interest because ρ is not invariant under rearrangement of the values of the independent random variables. More precisely, if Z = ψ(X), where ψ is one-to-one and measurable, the correlation coefficient between X and Y may be very different from that between Z and Y . Another disadvantage of the sample correlation coefficient is that ρ =0does not necessarily imply the independence. Therefore other measures of independence can be of interest. For the independence test problems, it is Spearman [13] who introduced Spearman’s rank correlation coefficient into statistics. T For a column Xi =(X1i,...,Xni) ,weordertheelementsX1i,...,Xni of Xi from least to greatest, and let Qki denote the rank of Xki. Replacing the matrix License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use 5348 WANG ZHOU Xn =(Xki)bythematrix(Qki), similar to (1.2), we can define the so-called Spearman rank correlation coefficient between Xi and Xj as n Q Q − 1 ( n Q )( n Q ) k=1 ki kj n k=1 ki k=1 kj rij = n − 2 n − 2 (Qki Qi) (Qkj Qj) (1.8) k=1 k=1 12 n 3(n +1) = Q Q − , n(n2 − 1) ki kj n − 1 k=1 where Qi = Qj =(n +1)/2. All the rij constitute Spearman’s rank correlation matrix Rn =(rij)1≤i,j≤p. The most important point is that the distribution of rij does not depend on nuisance parameters. Much information on rij can be found in H´ajek, Sid´ˇ ak and Sen [4].