Robust Scatter Matrix Estimation for High Dimensional Distributions with Heavy Tail Junwei Lu, Fang Han, and Han Liu
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON INFORMATION THEORY 1 Robust Scatter Matrix Estimation for High Dimensional Distributions with Heavy Tail Junwei Lu, Fang Han, and Han Liu Abstract—This paper studies large scatter matrix estimation distribution family, the pair-elliptical. The pair-elliptical family for heavy tailed distributions. The contributions of this paper is strictly larger and requires less symmetry structure than the are twofold. First, we propose and advocate to use a new elliptical. We provide detailed studies on the relation between distribution family, the pair-elliptical, for modeling the high dimensional data. The pair-elliptical is more flexible and easier the pair-elliptical and several heavy tailed distribution families, to check the goodness of fit compared to the elliptical. Secondly, including the nonparanormal, elliptical, and transelliptical. built on the pair-elliptical family, we advocate using quantile- Moreover, it is easier to test the goodness of fit for the based statistics for estimating the scatter matrix. For this, we pair-elliptical. For conducting such a test, we combine the provide a family of quantile-based statistics. They outperform the existing results in low dimensions (20; 21; 22; 23; 24) with existing ones for better balancing the efficiency and robustness. In particular, we show that the propose estimators have compa- the familywise error rate controlling techinques including the rable performance to the moment-based counterparts under the Bonferonni’s correction, the Holm’s step-down procedure (25), Gaussian assumption. The method is also tuning-free compared and the higher criticism method (26; 27). to Catoni’s M-estimator for covariance matrix estimation. We Secondly, built on the pair-elliptical family, we propose further apply the method to conduct a variety of statistical a new set of quantile-based statistics for estimating scat- methods. The corresponding theoretical properties as well as 1 numerical performances are provided. ter/covariance matrices . We also provide the theoretical prop- erties of the proposed methods. In particular, we show that Index Terms—Heavy-tailed distribution; Pair-Elliptical Distri- the proposed quantile-based methods outperform the existing bution; Quantile-based statistics; Scatter matrix. ones for better balancing the robustness and efficiency. As applications, we exploit the proposed estimators for conduct- I. INTRODUCTION ing several high dimensional statistical methods, and show Large covariance matrix estimation is a core problem in the advantages of using the quantile-based statistics both multivariate statistics. Pearson’s sample covariance matrix is theoretically and empirically. widely used for estimation and proves to enjoy certain optimal- ity under the subgaussian assumption (1;2;3;4;5). However, A. Other Related Works this assumption is not realistic in many real applications where data are heavy-tailed (6;7;8). The quantile-based statistics, such as the median absolute To handle heavy-tailed data, rank-based statistics are pro- deviation (29) and the Qn estimators (30; 31), have been used posed. Compared to Pearson’s sample covariance, rank-based in estimating marginal standard deviations. Their properties estimators achieve extra efficiency via exploiting the dataset’s in parameter estimation and robustness to outliers are further geometric structures. Such structures, like symmetry, are nat- studied in low dimensions (32). Moreover, these estimators urally involved in the data generating scheme and allow for have been generalized to estimate the dispersions between both efficient and robust inference. Conducting rank-based random variables (33; 34; 35; 36). covariance matrix estimation includes two steps. The first step Given these results, we mainly make three contributions: is to estimate the (latent) correlation matrix. For this, (9), (10), (i) Methodologically, we propose new quantile-based scatter (11), (12), (13), (14), and (15) exploit Spearman’s rho and matrix estimators that generalize the existing MAD and Qn Kendall’s tau estimators. They work under the nonparanormal estimators for better balancing the efficiency and robustness. or the transelliptical distribution family. The second step (ii) Theoretically, we provide more understandings on the is to estimate marginal variances. For this, (16), (17), and quantile-based methods. They confirm that the quantile-based (18) exploit Catoni’s M-estimator (19). However, Catoni’s estimators are also good alternatives to the prevailing moment- estimator requires to tune parameters. Moreover, it is sensitive based estimators in high dimensions. (iii) We propose a projec- to outliers and accordingly is not a robust estimator. tion method for overcoming the lack of positive semidefinite- In this paper, we strengthen the results in the literature in ness, which is typical in the robust scatter matrix estimation. two directions. First, we propose and advocate to use a new This approach maintains the efficiency as well as robustness to data contamination, while the prevailing SVD decomposition Junwei Lu is with the Department of Biostatistics, Harvard T.H. Chan approach (36) cannot. School of Public Health, Boston, MA (e-mail: [email protected]) Of note, the effectiveness of quantile-base methods is being Fang Han is with Department of Statistics, University of Washington, Seattle, WA (e-mail: [email protected]) realized in other fields in high dimensional statistics. For Han Liu is with the Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL (e-mail: han- 1The scatter matrix is any matrix proportional to the covariance matrix. See [email protected]). (28) for more details. IEEE TRANSACTIONS ON INFORMATION THEORY 2 example, (37), (38), and (39) provide analysis on the penalized The random variable X is heavy tailed if there exists some quantile regression and show that it can handle the case that p > 0 such that the noise term is very heavy-tailed. Our method, although very jjXjj ≤ K ≤ 1 for q ≤ p; and jjXjj = 1: different from theirs, shares similar properties. Lq Lp+1 The heavy tailedness of X is measured by how large p could B. Notation System be such that the p-th moment exists. d×d In the following we first provide an upper bound of the Let M = [Mjk] 2 R be a matrix and v = T d sample mean. It illustrates the “optimal rate but sub-optimal (v1; :::; vd) 2 R be a vector. We denote vI to be the subvector of v whose entries are indexed by a set I ⊂ scaling” phenomenon. f1; : : : ; dg. We denote M to be the submatrix of M T d I;J Theorem II.1. Suppose X = (X1;:::;Xd) 2 R is whose rows and columns are indexed by I and J. For a random vector with the population mean µ. Assume X γ 0 < q < 1, we define the `0, `q, and `1 vector (pseudo- satisfies jjXjjjL ≤ K, where we assume d = O(n ) and Pd p )norms to be jjvjj0 := j=1 I(vj 6= 0); jjvjjq := p = 2+2γ+δ. Letting µ¯ be the sample mean of n independent Pd q 1=q ( j=1 jvjj ) ; and jjvjj1 := max1≤j≤d jvjj: Here I(·) observations of X, we then have, with probability no smaller represents the indicator function. For a matrix M, we than 1 − 2d−2:5 − (log d)p=2n−δ=2, denote the matrix `q, `max, and Frobenius `F-norms of r log d M to be jjMjjq := maxkvkq =1 kMvkq; jjMjjmax := jjµ¯ − µjj1 ≤ 12K · : P 2 1=2 n maxjk jMjkj; and jjMjjF := ( j;k jMjkj ) : For any d×d p matrix M 2 R , we denote diag(M) to be the diagonal Theorem II.1 shows that, for preserving the OP ( log d=n) d×d matrix with the same diagonal entries as M, and Id 2 R to rate of convergence, p determines how large the dimension d be the d by d identity matrix. Let λj(M) and uj(M) represent can be compared to n. For example, when at most (4 + )-th the j-th largest eigenvalue and the corresponding eigenvector moment exists for X for some > 0, the sample mean attains T p of M, and hM1; M2i := Tr(M1 M2) be the inner product of the optimal rate OP ( log d=n) under the suboptimal scaling M1 and M2. For any two random vectors X and Y , we write d = O(n). X =D Y if and only if X and Y are identically distributed. The results in Theorem II.1 cannot be improved without Throughout the paper, we let c; C be two generic absolute adding more assumptions. Via a worst case analysis, the next constants, whose values may vary at different locations. theorem characterizes the sharpness of Theorem II.1. Theorem II.2. For any fixed constant C, p = 2 + 2γ with γ+δ0 C. Paper Organization γ > 0, and d = n for some δ0 > 0, there exists certain The rest of the paper is organized as follows. In the next random vector X, satisfying section we provide the theoretical evaluation of the impacts jjX jj < K; for some absolute constant K > 0 of heavy tails on the moment-based estimators. This motivates i Lq our work. Section III proposes the pair-elliptical family, and and all q ≤ p, such that, with probability tending to 1, we reveals the connection among the Gaussian, elliptical, non- have r paranormal, transelliptical, and pair-elliptical. In SectionIV, C log d jjµ¯ − µjj1 ≥ : we introduce the generalized MAD and Qn estimators for n estimating scatter/covariance matrices. SectionV provides the Theorems II.1 and Theorem II.2 together illustrate the con- theoretical results. SectionVI discusses parameter selection. In straints of applying moment-based estimators to study heavy Section VII, we apply the proposed estimators to conduct mul- tailed distributions. This motivates us to consider alternative tiple multivariate methods. We put experiments on synthetic methods that are more efficient in handling heavy tailedness. and real data in Section VIII, more discussions in SectionIX, and technical proofs in the appendix.