IEEE TRANSACTIONS ON INFORMATION THEORY 1

Robust Scatter Matrix Estimation for High Dimensional Distributions with Heavy Tail Junwei Lu, Fang Han, and Han Liu

Abstract—This paper studies large scatter matrix estimation distribution family, the pair-elliptical. The pair-elliptical family for heavy tailed distributions. The contributions of this paper is strictly larger and requires less symmetry structure than the are twofold. First, we propose and advocate to use a new elliptical. We provide detailed studies on the relation between distribution family, the pair-elliptical, for modeling the high dimensional data. The pair-elliptical is more flexible and easier the pair-elliptical and several heavy tailed distribution families, to check the goodness of fit compared to the elliptical. Secondly, including the nonparanormal, elliptical, and transelliptical. built on the pair-elliptical family, we advocate using quantile- Moreover, it is easier to test the goodness of fit for the based for estimating the scatter matrix. For this, we pair-elliptical. For conducting such a test, we combine the provide a family of quantile-based statistics. They outperform the existing results in low dimensions (20; 21; 22; 23; 24) with existing ones for better balancing the efficiency and robustness. In particular, we show that the propose estimators have compa- the familywise error rate controlling techinques including the rable performance to the moment-based counterparts under the Bonferonni’s correction, the Holm’s step-down procedure (25), Gaussian assumption. The method is also tuning-free compared and the higher criticism method (26; 27). to Catoni’s M-estimator for estimation. We Secondly, built on the pair-elliptical family, we propose further apply the method to conduct a variety of statistical a new set of quantile-based statistics for estimating scat- methods. The corresponding theoretical properties as well as 1 numerical performances are provided. ter/covariance matrices . We also provide the theoretical prop- erties of the proposed methods. In particular, we show that Index Terms—Heavy-tailed distribution; Pair-Elliptical Distri- the proposed quantile-based methods outperform the existing bution; Quantile-based statistics; Scatter matrix. ones for better balancing the robustness and efficiency. As applications, we exploit the proposed estimators for conduct- I.INTRODUCTION ing several high dimensional statistical methods, and show Large covariance matrix estimation is a core problem in the advantages of using the quantile-based statistics both . Pearson’s sample covariance matrix is theoretically and empirically. widely used for estimation and proves to enjoy certain optimal- ity under the subgaussian assumption (1;2;3;4;5). However, A. Other Related Works this assumption is not realistic in many real applications where data are heavy-tailed (6;7;8). The quantile-based statistics, such as the median absolute To handle heavy-tailed data, rank-based statistics are pro- deviation (29) and the Qn estimators (30; 31), have been used posed. Compared to Pearson’s sample covariance, rank-based in estimating marginal standard deviations. Their properties estimators achieve extra efficiency via exploiting the dataset’s in parameter estimation and robustness to outliers are further geometric structures. Such structures, like symmetry, are nat- studied in low dimensions (32). Moreover, these estimators urally involved in the data generating scheme and allow for have been generalized to estimate the dispersions between both efficient and robust inference. Conducting rank-based random variables (33; 34; 35; 36). covariance matrix estimation includes two steps. The first step Given these results, we mainly make three contributions: is to estimate the (latent) correlation matrix. For this, (9), (10), (i) Methodologically, we propose new quantile-based scatter (11), (12), (13), (14), and (15) exploit Spearman’s rho and matrix estimators that generalize the existing MAD and Qn Kendall’s tau estimators. They work under the nonparanormal estimators for better balancing the efficiency and robustness. or the transelliptical distribution family. The second step (ii) Theoretically, we provide more understandings on the is to estimate marginal variances. For this, (16), (17), and quantile-based methods. They confirm that the quantile-based (18) exploit Catoni’s M-estimator (19). However, Catoni’s estimators are also good alternatives to the prevailing moment- estimator requires to tune parameters. Moreover, it is sensitive based estimators in high dimensions. (iii) We propose a projec- to outliers and accordingly is not a robust estimator. tion method for overcoming the lack of positive semidefinite- In this paper, we strengthen the results in the literature in ness, which is typical in the robust scatter matrix estimation. two directions. First, we propose and advocate to use a new This approach maintains the efficiency as well as robustness to data contamination, while the prevailing SVD decomposition Junwei Lu is with the Department of Biostatistics, Harvard T.H. Chan approach (36) cannot. School of Public Health, Boston, MA (e-mail: [email protected]) Of note, the effectiveness of quantile-base methods is being Fang Han is with Department of Statistics, University of Washington, Seattle, WA (e-mail: [email protected]) realized in other fields in high dimensional statistics. For Han Liu is with the Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL (e-mail: han- 1The scatter matrix is any matrix proportional to the covariance matrix. See [email protected]). (28) for more details. IEEE TRANSACTIONS ON INFORMATION THEORY 2 example, (37), (38), and (39) provide analysis on the penalized The random variable X is heavy tailed if there exists some quantile regression and show that it can handle the case that p > 0 such that the noise term is very heavy-tailed. Our method, although very ||X|| ≤ K ≤ ∞ for q ≤ p, and ||X|| = ∞. different from theirs, shares similar properties. Lq Lp+1 The heavy tailedness of X is measured by how large p could B. Notation System be such that the p-th moment exists. d×d In the following we first provide an upper bound of the Let M = [Mjk] ∈ R be a matrix and v = T d sample mean. It illustrates the “optimal rate but sub-optimal (v1, ..., vd) ∈ R be a vector. We denote vI to be the subvector of v whose entries are indexed by a set I ⊂ scaling” phenomenon. {1, . . . , d}. We denote M to be the submatrix of M T d I,J Theorem II.1. Suppose X = (X1,...,Xd) ∈ R is whose rows and columns are indexed by I and J. For a random vector with the population mean µ. Assume X γ 0 < q < ∞, we define the `0, `q, and `∞ vector (pseudo- satisfies ||Xj||L ≤ K, where we assume d = O(n ) and Pd p )norms to be ||v||0 := j=1 I(vj 6= 0), ||v||q := p = 2+2γ+δ. Letting µ¯ be the sample mean of n independent Pd q 1/q ( j=1 |vj| ) , and ||v||∞ := max1≤j≤d |vj|. Here I(·) observations of X, we then have, with probability no smaller represents the indicator function. For a matrix M, we than 1 − 2d−2.5 − (log d)p/2n−δ/2, denote the matrix `q, `max, and Frobenius `F-norms of r log d M to be ||M||q := maxkvkq =1 kMvkq, ||M||max := ||µ¯ − µ||∞ ≤ 12K · . P 2 1/2 n maxjk |Mjk|, and ||M||F := ( j,k |Mjk| ) . For any d×d p matrix M ∈ R , we denote diag(M) to be the diagonal Theorem II.1 shows that, for preserving the OP ( log d/n) d×d matrix with the same diagonal entries as M, and Id ∈ R to rate of convergence, p determines how large the dimension d be the d by d identity matrix. Let λj(M) and uj(M) represent can be compared to n. For example, when at most (4 + )-th the j-th largest eigenvalue and the corresponding eigenvector moment exists for X for some  > 0, the sample mean attains T p of M, and hM1, M2i := Tr(M1 M2) be the inner product of the optimal rate OP ( log d/n) under the suboptimal scaling M1 and M2. For any two random vectors X and Y , we write d = O(n). X =D Y if and only if X and Y are identically distributed. The results in Theorem II.1 cannot be improved without Throughout the paper, we let c, C be two generic absolute adding more assumptions. Via a worst case analysis, the next constants, whose values may vary at different locations. theorem characterizes the sharpness of Theorem II.1. Theorem II.2. For any fixed constant C, p = 2 + 2γ with γ+δ0 C. Paper Organization γ > 0, and d = n for some δ0 > 0, there exists certain The rest of the paper is organized as follows. In the next random vector X, satisfying section we provide the theoretical evaluation of the impacts ||X || < K, for some absolute constant K > 0 of heavy tails on the moment-based estimators. This motivates i Lq our work. Section III proposes the pair-elliptical family, and and all q ≤ p, such that, with probability tending to 1, we reveals the connection among the Gaussian, elliptical, non- have r paranormal, transelliptical, and pair-elliptical. In SectionIV, C log d ||µ¯ − µ||∞ ≥ . we introduce the generalized MAD and Qn estimators for n estimating scatter/covariance matrices. SectionV provides the Theorems II.1 and Theorem II.2 together illustrate the con- theoretical results. SectionVI discusses parameter selection. In straints of applying moment-based estimators to study heavy Section VII, we apply the proposed estimators to conduct mul- tailed distributions. This motivates us to consider alternative tiple multivariate methods. We put experiments on synthetic methods that are more efficient in handling heavy tailedness. and real data in Section VIII, more discussions in SectionIX, and technical proofs in the appendix. III.PAIR-ELLIPTICAL DISTRIBUTION

II.IMPACTS OF HEAVY TAILSON MOMENT-BASED In this section, we introduce the pair-elliptical distribution ESTIMATORS family. We first briefly review several existing distribution fam- ilies: Gaussian, elliptical, nonparanormal, and transelliptical. This section illustrates the motivation of quantile-based Then we elaborate the relations between the pair-elliptical and estimators. In particular, we show how moment-based esti- aforementioned families. mators fail for heavy tailed data. These estimators include the sample mean and sample covariance matrix. Such estimators are known to be efficient under stringent moment assump- A. Multivariate Distribution Families tions (5). However, their performance drops down when such We start by first introducing the elliptical distribution. The assumptions are violated (40;9). elliptical family contains symmetric but possibly very heavy We characterize the heavy tailedness by the Lp norm. In tailed distributions. detail, for any random variable X ∈ and integer p ≥ 1, we R Definition III.1 (Elliptical distribution, (41)). A d- define the L norm of X as p dimensional random vector X is said to follow an elliptical p 1/p d ||X||Lp := (E|X| ) . distribution if and only if there exists a vector µ ∈ R , a IEEE TRANSACTIONS ON INFORMATION THEORY 3 nonnegative random variable ξ ∈ R, a matrix A ∈ Rd×q As a special example, a distribution is said to be pair-normal, q (q ≤ d) of rank q, a random vector U ∈ R uniformly written as PNd(µ, S), if any pairs of X is bivariate Gaussian distributed in q-dimension sphere Sq−1 and independent from distributed. ξ, such that It is obvious that the pair-elliptical family contains the D X = µ + ξAU. elliptical distribution family. Moreover, the elliptical is a strict subfamily of the pair-elliptical by considering the following In this case, we represent X ∼ ECd(µ, S, ξ), where S := example. AAT is of rank q. Example III.5. Let f(X1,X2,X3) be the density function Remark III.2. An equivalent definition of the elliptical distri- of a three dimensional standard Gaussian distribution with bution is: Any random vector X ∼ ECd(µ, S, ξ) is elliptically T median 0 and covariance matrix I3, and X = (X1,X2,X3) distributed if and only if the characteristic function of X is of be a 3-dimensional random vector with the density function the form exp(itT µ)φ(tT St), where i is the imaginary number  satisfying i2 = −1, φ is a properly defined characteristic 2f(X1,X2,X3), if X1X2X3 ≥ 0, g(X1,X2,X3) = function, and there exists a one to one map between ξ and 0, otherwise. φ. In this case, we represent X ∼ ECd(µ, S, φ). Moreover, (III.2) when the elliptical distribution is absolutely continuous, the The distribution in Example III.5 with density in (III.2) T −1 density function is of the form g((x − µ) S (x − µ)) for is bivariate Gaussian distributed for any pairwise marginal some nonnegative function g(·). In this case, we represent distributions, and therefore belongs to the pair-elliptical family. X ∼ ECd(µ, S, g). On the other hand, this distribution is marginally Gaussian Although elliptical distributions have been extensively ex- distributed but not multivariate Gaussian distributed, and ac- plored in modeling many real world data, including financial cordingly cannot be elliptically distributed. (42; 43; 44; 45) and imaging data (46; 47), it can be quite Example III.5 also shows that the pair-elliptical distribution restrictive due to the symmetry constraint (48). One way to can be asymmetric. Moreover, the pair-elliptical distribution handle asymmetric data is to exploit the copula technique. has a naturally defined scatter matrix S, which is proportional 2 This results to the transelliptical family (meta-elliptical family) to the covariance matrix Σ when Eξ exists. This makes proposed and discussed in (49) and (8). Below we give the the pair-elliptical compatible with many multivariate methods formal definition of the transelliptical distribution in (8). such as principal component analysis and linear discriminant analysis. Definition III.3 (Transelliptical distribution, (8)). A continu- The rest of this section focuses on characterizing the rela- T ous random vector X = (X1,...,Xd) follows a transellip- tions among the Gaussian, elliptical2, transelliptical, nonpara- 0 tical distribution, denoted by X ∼ TEd(Σ , ξ; f1, . . . , fd), if normal, pair-elliptical, and pair-normal families. Recall that in there exist univariate strictly increasing functions f1, . . . , fd this paper we are only interested in the continuous distributions such that with density existing. It is obvious that the Gaussian family is T 0 a strict subfamily of the elliptical, and the elliptical is also (f1(X1), . . . , fd(Xd)) ∼ ECd(0, Σ , ξ), (III.1) 0 a strict subfamily of both the transelliptical and the pair- where diag(Σ ) = Id and P(ξ = 0) = 0. elliptical. The next proposition shows that the only intersection between the elliptical and the nonparanormal is the Gaussian. In particular, when Proposition III.6 ((14)). If a random vector is both nonpara- T 0 0 (f1(X1), . . . , fd(Xd)) ∼ Nd(0, Σ ), where diag(Σ ) = Id, normally and elliptically distributed, it must follow a Gaussian distribution. X follows a nonparanormal distribution (50;9). Here Σ0 is called the latent generalized correlation matrix. In the next proposition, we show that the only intersection between the transelliptical and the pair-elliptical is the ellipti- cal. B. Pair-Elliptical Distribution Proposition III.7. If a random vector is both transelliptically In this section we propose a new distribution family, the and pair-elliptically distributed, it must follow an elliptical pair-elliptical. Compared to the elliptical and transelliptical, distribution. the pair-elliptical distribution is of more interest to us. Specif- ically, it balances the modeling flexibility and interpretability We defer the proof of Preposition III.7 to the appendix. in covariance/scatter matrices estimation. In the end, let’s consider the relation among the pair-normal and all the other distribution families. By definition, the pair- Definition III.4. A continuous random vector X = normal is a strict subfamily of the pair-elliptical. On the other T (X1,...,Xd) is said to follow a pair-elliptical distribution, hand, the next proposition shows that any random scaled denoted by X ∼ PEd(µ, S, ξ), if and only if any pair of version of the pair-normal is pair-elliptically distributed. T random variables (Xj,Xk) of X is elliptically distributed. In other words, we have 2In the rest of this section we only focus on the continuous elliptical distributions with (ξ = 0) = 0. And we are only interested in those whose T   P (Xj,Xk) ∼ EC2 µ{j,k}, S{j,k},{j,k}, ξ for all j 6= k ∈ {1, . .covariance . , d} and matrixP(ξ is= not 0) the = 0identity.. IEEE TRANSACTIONS ON INFORMATION THEORY 4

In the first step, we apply the statistic proposed in (24) for testing pairwise elliptical symmetry. Let Z and Σb be n the sample mean and sample covariance of {Zi}i=1. We −1/2 standardize√ the data by letting Yi := Σb (Zi − Z) and transelliptical elliptical* pair-elliptical 2 t(Zi) := 2Yfi/wi where Yei := (Yi1 + Yi2)/2 and wi := P2 2 j=1(Yij − Yei) for i = 1, . . . , n. Under H0, we have a nonparanormal pair-normal D t(Zi) → t1 for i = 1, . . . , n, where t1 is the t distribution with degree of freedom 1. To study√ the goodness-of-fit of the t-distribution, we define M := [ n], where [·] represents the Gaussian integer part of a real number and E := n/M. Let T` be the Fig. 1. The Venn diagram illustrating the relations of the Gaussian, el- `/M × 100% quantile of the t1 distribution for ` = 0,...,M, liptical, nonparanormal, transelliptical, pair-normal, and pair-elliptical where T0 := −∞ and TM := +∞. We also denote the families. Here “elliptical*” represents the continuous elliptical family observed frequency O := |{t(Z ): T < t(Z ) ≤ T }| with (ξ = 0) = 0. ` i `−1 i ` P for 1 ≤ ` ≤ M.(24) consider the following Pearson’s chi- squared test statistic: Proposition III.8. Let Y ∼ PNd(µ, S) follow a pair-normal M 2 distribution. Then for any nonnegative random variable ξ with X (O` − E) Z({Z }) := . (ξ = 0) = 0 and independent of Y , we have X = µ0 + ξY i E P `=1 follows a pair-elliptical distribution. By its nature, Z({Z }) is asymptotically chi-squared dis- In the end, we have the following proposition, which i tributed with degrees of freedom M − 1. characterizes pair-normal’s connections to the elliptical and In the second step, we screen the data to find whether nonparanormal distributions. there is any pair {Xj,Xk} that does not follow an elliptical Proposition III.9. For the pair-normal, elliptical, and non- distribution. Considering the following null hypothesis for any paranormal distributions, we have 1 ≤ j, k ≤ d: (i) The only intersection between the pair-normal and ellip- tical is the Gaussian; Hjk : {Xj,Xk} are elliptically distributed, (III.4) (ii) The only intersection between the pair-normal and non- we use the Holm’s step-down procedure (25) to con- paranormal is the Gaussian. trol the family-wise error rate. Denote the p-values of In conclusion, the Venn diagram in Figure1 summarizes Z({(Xij,Xik)}) as πjk and let mjk be the rank statistic of the relation among the Gaussian, elliptical, nonparanormal, πjk such that transelliptical, pair-normal, and pair-elliptical. From the figure, 0 0 we can see that the Gaussian distribution locates in the central mjk := |{πj0k0 | πj0k0 < πjk, 1 ≤ j , k ≤ d}|. area whose covariance can be well estimated by the sample The Holm’s adjusted p-values are defined as covariance matrix. The transelliptical covers the left-hand H  tj0k0 0 0 side of the diagram, where we advocate using rank-based πjk := max 1−(1−πj0k0 ) | mj0k0 ≤ mjk, j , k , (III.5) estimators to estimate the covariance matrix. The pair-elliptical 2(mjk−1) covers a new regime on the right-hand side of the diagram, where tjk = 1 − d(d−1) . Applying the adjusted p-values, H where we will introduce the quantile-based estimators for we reject Hjk if πjk is smaller than the level of significance estimating the covariance matrix. α. Let ω0 := {(j, k) | Hjk in (III.4) is true, 1 ≤ j 6= k ≤ d}, (25) shows that we can control the family-wise error rate as C. Goodness of Fit Test of the Pair-Elliptical   πH ≤ α for some (j, k) ∈ ω ≤ α. This section proposes a goodness of fit test of the pair- Pω0 jk 0 elliptical. The pair-elliptical family has its advantages here: Under the setting of goodness of fit test and H0 in (III.3), we Both the transelliptical and elliptical require global geomet- have ω0 = {(j, k) | 1 ≤ j 6= k ≤ d} and therefore ric constraints over all covariates; In comparison, the pair-   elliptical only requires a local pairwise symmetry structure, H PH0 πjk ≤ α for some j 6= k ≤ α. which could be more easily checked. In this section, we combine the test of elliptical symmetry proposed in (24) with IV. QUANTILE-BASED SCATTER MATRIX ESTIMATION the step-down procedure in (25) for performing the pair- elliptical goodness of fit test. This section introduces the quantile-based scatter matrix Specifically, we propose a test of pair-elliptical: estimators. To this end, we first briefly review the existing quantile-based estimators, including MAD and Q proposed H : The data are pair-elliptically distributed. (III.3) n 0 in (29) and (30). Secondly, we generalize these two estimators The proposed test is in two steps: In the first step, we test the for estimating scatter matrices. Thirdly, we introduce the pairwise elliptical symmetry; In the second step, we use the projection idea for constructing a positive semidefinite scatter Holm’s step-down procedure to control the family-wise error. matrix estimator. IEEE TRANSACTIONS ON INFORMATION THEORY 5

A. Robust Scale Estimation same idea. The population and sample versions of the gQNE This section briefly reviews the robust estimators of the for the j-th entry of X are: marginal standard deviation. To this end, we first define the Q gQNE : σ (Xj; r) := Q(|Xj − Xej|; r), population and sample versions of the quantile function. For Q σ (Xj; r) := Qb({|Xij − Xi0j|}i

Qn MAD M;r M 2 M;r M;r M where c is another constant comparable to c . Qn is Rjj = (σ (Xj; r)) , Rjk = Rkj = σ (Xj,Xk; r); known to be more efficient than MAD. Q;r Q 2 Q;r Q;r Q Rjj = (σ (Xj; r)) , Rjk = Rkj = σ (Xj,Xk; r). M;r Q;r B. Robust Scatter Matrix Estimators In the later section we will show that R and R are indeed scatter matrices under the pair-elliptical family. Let In this section, we propose our approach for estimating the Rb M;r and Rb Q;r be the empirical versions of RM;r and RQ;r scatter matrix. This is via generalizing the aforementioned M Q M Q M;r via replacing σ (·) and σ (·) by σb (·) and σb (·). Rb and quantile-based scale estimators. Q;r Rb are the proposed robust scatter matrix estimators. Assume X1,..., Xn are n independent observations of There are two remarks. First, we do not discuss how to T a d-dimensional random vector X = (X1,...,Xd) with select r in this section, which will be studied in more details T X = (X ,...,X ) M;r Q;r i i1 id . We first propose to estimate the in SectionVI. Secondly, we note that Rb and Rb are marginal standard deviations by generalizing the MAD and Qn both symmetric matrices by definition. However, they are not estimators. Specifically, we define generalized median absolute necessarily positive semidefinite. We will discuss this issue in deviation (gMAD) as follows: For the j-th entry of X, the the next section. population and sample versions of gMAD are:

M gMAD : σ (Xj; r) := Q(|Xj − Q(Xj; 1/2)|; r), C. Projection Method M n In this section we introduce the projection idea to overcome σb (Xj; r) := Qb(|Xij − Qb({Xij}i=1; 1/2)|; r). (IV.2) the lack of positive semidefiniteness (PSD) in robust covariant 3 Here the median is replaced by the r × 100% quantile . We matrix estimation. It is known that when the dimension is close then define the generalized Qn estimator (gQNE) using the to or higher than the sample size, the robust covariance matrix estimator can be non-PSD (28). To illustrate this, Figure2 3Later we will show that using the r-th quantile instead of the median can potentially increase the efficiency of the estimator, in the cost of losing some shows the averaged least eigenvalue of the MAD scatter matrix robustness though. estimator under the standard multivariate Gaussian model, with IEEE TRANSACTIONS ON INFORMATION THEORY 6

● (A) 2% Contamination (B) 5% Contamination

0.0 ● 2.0

● 2.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.0

1.5 ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ● ●

● ● ● ● 1.5 ●

● 1.0 ● ● ● ● ● ● ● 1.0 estimation error estimation error ● ● −1.0 ● ● least eigenvalue of MAD least eigenvalue

● 0.5 ● ● ● ● 0.5 ● Projection Projection ● ● SVD ● SVD ● MAD MAD

● 0.0 0.0 ● −1.5 ● 0 20 40 60 80 100 0 20 40 60 80 100 dimension (d) dimension (d) 0 20 40 60 80 100 dimension (d) Fig. 3. The plot of the averaged estimation erros of the MAD Fig. 2. The plot of the averaged least eigenvalues of a MAD scatter (”MAD”) and PSD scatter matrix estimators using the projection and matrix (i.e., Rb M;1/2) against the dimension d ranging from 2 to 200. SVD decomposition ideas (denoted by ”Projection” and ”SVD”). The Here the n = 50 observations are coming from the standard Gaussian distances are calculated based on the || · ||max norm and are plotted distribution with dimension d, and the simulations are conducted with against the dimension d ranging from 2 to 100. Here the sample 100 repetitions. size is 50 and observations are coming from a standard Gaussian distributed data with dimension d, and 2% and 5% data points are randomly chosen and replaced by +N(3, 3) or −N(3, 3). The results the sample size n fixed to be 50 and the dimension d increasing are obtained based on 200 repetitions. from 2 to 200. The lack of PSD can cause problems for many high di- and replaced by +N(3, 3) or −N(3, 3). It shows that the PSD mensional multivariate methods. To solve it, we propose a estimator obtained by projection is as insensitive as the MAD general projection method. In detail, for arbitrary non-PSD estimator (and their estimation accuracy is very close). On the matrix estimator Rb , we consider the projection of Rb to the other hand, Maronna’s method is very sensitive to such data positive semidefinite matrix cone: contamination.

Re = arg min ||M − Rb ||, (IV.4) V. THEORETICAL RESULTS M0 This section provides the theoretical results of the proposed where M  0 represents that M is PSD and || · || is a quantile-based gQNE and gMAD scatter matrix estimators. certain matrix norm of interest. For any given norm || · ||, The section is divided into two parts: In the first part, under the a computationally efficient algorithm to solve (IV.4) is given pair-elliptical family, we characterize the relations among the in Supplementary Material SectionK. population gQNE, gMAD statistics and Pearson’s covariance Due to reasons that will become clearer later, we are matrix; In the second part, we provide the theoretical analysis interested in the projection with regard to the matrix element for gQNE and gMAD estimators. wise supremum norm || · ||max in (IV.4). Of note, Re and Rb have the same breakdown point because Re is independent of the data conditioning on Rb . Moreover, we have the following A. Quantile-Based Estimators under the Pair-Elliptical property about Rb . In this section we show that the population gMAD and gQNE statistics, RM;r and RQ;r, are scatter matrices of X Lemma IV.1. Let Re be the solution to (IV.4) with certain matrix norm || · || of interest. We have, for any t ≥ 0 and when X is pair-elliptically distributed. d×d We first focus on gMAD. The next theorem characterizes a M ∈ R with M  0, sufficient condition under which RM;r is proportional to the    t  M;r P Re − R ≥ t ≤ P Rb − R ≥ . covariance matrix. It also quantifies the scale constant c 2 that connects RM;r to the covariance matrix. Of note, (36) propose an alternative approach to solve the Theorem V.1. X = (X ,...,X )T d non-PSD problem. Their method exploits the SVD decom- Suppose that 1 d is a - Σ ∈ position of any given non-PSD matrix. However, Maronna’s dimensional random vector with the covariance matrix d×d cM;r method is not a robust procedure and is sensitive to outliers. R . Then there exists some constant such that More specifically, Figure3 shows the averaged distance be- RM;r = cM;rΣ, tween the population scatter matrix and three different scat- ter matrix estimators: the possibly non-PSD MAD estimator if for any j 6= k ∈ {1, . . . , d}, (denoted by “MAD”), the PSD estimator calculated by using √  X − Q(X , 1/2)  M;r j j Maronna’s SVD decomposition idea (denoted by “SVD”), c = Q ; r σ(Xj) and the PSD estimator calculated by our projection idea  Xj + Xk − Q(Xj + Xk; 1/2)  with regard to the || · || norm (denoted by “Projection”). = Q ; r max σ(X + X ) Figures3 (A) and (B) illustrate the results regarding a standard j k  X − X − Q(X − X ; 1/2)  Gaussian distributed data (i.e., the data follow a N (0, I ) j k j k d d = Q ; r , (V.1) distribution) with 2% and 5% points being randomly chosen σ(Xj − Xk) IEEE TRANSACTIONS ON INFORMATION THEORY 7

and the above quantiles are all unique. We suppose that, for some constants κ1 and η1 that might scale with n, the following assumption holds: We then study gQNE. The next theorem gives a sufficient condition under which RQ;r is proportional to the covariance d matrix and again quantifies the scale constant cQ;r. (A1). min F1;j(y) ≥ η1, {j,|y−Q(F1;j ;1/2)|<κ1} dy T Theorem V.2. Suppose that X = (X1,...,Xd) is a d- d min F¯1;j(y) ≥ η1, dimensional random vector with the covariance matrix Σ ∈ {j,|y−Q(F¯1;j ;r)|<κ1} dy Rd×d. Let Xf be an independent copy of X and Z = d (Z ,...,Z )T := X − X. Then there exists some constant min F2;j,k(y) ≥ η1, 1 d f {j6=k,|y−Q(F2;j,k;1/2)|<κ1} dy cQ;r such that d ¯ RQ;r = cQ;rΣ, min F2;j,k(y) ≥ η1, {j6=k,|y−Q(F¯2;j,k;r)|<κ1} dy if for any j 6= k ∈ {1, . . . , d}, d min F3;j,k(y) ≥ η1, {j6=k,|y−Q(F ;1/2)|<κ } dy q  Z   Z + Z  3;j,k 1 Q;r j j k c /2 = Q ; r = Q ; r d σ(Zj) σ(Zj + Zk) min F¯3;j,k(y) ≥ η1, ¯ dy  Z − Z  {j6=k,|y−Q(F3;j,k;r)|<κ1} = Q j k ; r , (V.2) σ(Zj − Zk) where for the random variable X with distribution function and the above quantiles are all unique. F , we denote Q(F ; r) := Q(X; r). Assumption (A1) requires that the density functions do not degenerate around median For any random variable X ∈ R, Y is said to be the or the r-th quantiles. It is easy to check that Assumption normalized version of X if Y = (X − Q(X, 1/2))/σ(X). −1 p (A1) is satisfied when we choose η1 = O( kΣkmax) for Accordingly, we have that (V.1) holds if the normalized Gaussian distribution. Based on Assumption (A1), we have versions of Xj,Xj + Xk, and Xj − Xk are all identically the following theorem, characterizing the estimation accuracy distributed, and (V.2) holds if the normalized versions of of the gMAD estimator. Zj,Zj + Zk, and Zj − Zk are all identically distributed. The next theorem shows that (V.1) and (V.2) hold under the Theorem V.5 (gMAD concentration). Suppose that Assump- pair-elliptical family. tion (A1) holds and κ1 is lower bounded by a positive absolute constant. Then we have, for n large enough, with probability Theorem V.3. For any pair-elliptically distributed random no smaller than 1 − 24α2, M;r Q;r vector X ∼ PEd(µ, S, ξ), we have both R and R 2 M;r M;r are proportional to S. In particular, when Eξ < ∞, we have ||Rb − R ||max ≤ M;r Q;r r both R and R are proportional to the covariance matrix n 6  log d + log(1/α) 1 2 Cov(X) and max 2 + , η1 n n 2 2 M;r   1 + r  Q;r   1 + r  p M;r r c = Q X ; and c = 2 Q Z ; , 4 ||R ||max  log d + log(1/α) 1 o 0 2 0 2 + . (V.3) η1 n n In particular, when X is pair-elliptically distributed with the where X0 and Z0 are the normalized versions of X1 and Z1. covariance matrix Σ existing, we have, with probability no Remark V.4. Theorem V.3 shows that, under the pair- smaller than 1 − 24α2, elliptical family, RM;r and RQ;r are both proportional to M;r M;r Cov(X) when the covariance exists. Of note, by Theorems ||Rb − c Σ||max ≤ V.1 and V.2, RM;r or RQ;r is proportional to Cov(X) as r n 6  log d + log(1/α) 1 2 long as (V.1) or (V.2) holds, and therefore can be applied to max + , η2 n n study potentially much larger family than the pair-elliptical. 1 r 4p||cM;rΣ||  log d + log(1/α) 1 o max + . η n n B. Theoretical Properties of gMAD and gQNE 1 M;r This section studies the estimation accuracy for the pro- Theorem V.5 shows that, when κ1, η1, ||Σ||max, and c posed scatter matrix estimators Rb M;r and Rb Q;r. We show are upper and lower bounded by positive absolute constants, M;r that the proposed methods are capable of handling heavy- the convergence rate of Rb with regard to the k · kmax p tailed distributions, and shed light towards robust alternatives is OP ( log d/n). This is comparable to the existing results to many multivariate methods in high dimensions. under subgaussian settings (See, for example, Theorem 1 in Before proceeding to the main results, we first intro- (4) and the discussion therein). duce some extra notation. For any random vector X = We then proceed to quantify the estimation accuracy of T Q;r T (X1,...,Xd) and any j 6= k ∈ {1, . . . , d}, we denote the gQNE estimator Rb . Let Xf = (Xe1,..., Xed) be F1;j, F¯1;j,F2;j,k, F¯2;j,k,F3;j,k, and F¯3;j,k to be the distribution an independent copy of X. For any j 6= k ∈ {1, . . . , d}, functions of Xj, |Xj − Q(Xj; 1/2)|,Xj + Xk, |Xj + Xk − let G1;j,G2;j,k, and G3;j,k be the distribution functions of Q(Xj +Xk; 1/2)|,Xj −Xk, and |Xj −Xk −Q(Xj −Xk; 1/2)|. |Xj −Xej|, |Xj +Xk −(Xej +Xek)|, and |Xj −Xk −(Xej −Xek)|. IEEE TRANSACTIONS ON INFORMATION THEORY 8

Suppose that for some constants κ2 and η2 that might scale Corollary V.8. Under Assumptions (A1) and (A2), we have with n, the following assumption holds: with probability no smaller than 1 − 24α2,

M;r M;r d ||Re − R ||max ≤ (A2). min G1;j(y) ≥ η2, r 2 {j,|y−Q(G1;j ;r)|<κ2} dy n 3  log d + log(1/α) 1  max + , d η2 n n min G (y) ≥ η , 1 2;j,k 2 p r {j6=k,|y−Q(G2;j,k;r)|<κ2} dy 2 ||RM;r||  log d + log(1/α) 1 o max + ; d η n n min G3;j,k(y) ≥ η2. 1 {j6=k,|y−Q(G ;r)|<κ } dy 3;j,k 2 and with probability no smaller than 1 − 8α, Provided that Assumption (A2) holds, we have the following Q;r Q;r ||Re − R ||max ≤ theorem. It gives the rate of convergence for RQ;r with regard b r 2 to the element-wise supremum norm. n 1  2 log d + log(1/α) 1  max 2 + , η2 n n Theorem V.6 (gQNE concentration). Suppose that Assump- p Q;r r ||R ||max  2 log d + log(1/α) 1 o tion (A2) holds and κ2 is lower bounded by a positive absolute + . constant. Then we have, for n large enough, with probability η2 n n no smaller than 1 − 8α, VI.SELECTIONOFTHE PARAMETER r Q;r Q;r ||Rb − R ||max ≤ Theorems V.5 and V.6 show that the estimation accuracy of r gMAD and gQNE estimators depends on the selection of the n 2  2 log d + log(1/α) 1 2 parameter r. In particular, the estimation error in estimating max 2 + , η2 n n RM;r and RQ;r is related to η , η and cM;r, cQ;r. On the other r 1 2 2p||RQ;r||  2 log d + log(1/α) 1 o hand, r determines the breakdown points of RM;r and RQ;r. max + . η2 n n Accordingly, the parameter r reflects the tradeoff between efficiency and robustness. In particular, when X is pair-elliptically distributed with the This section focuses on selecting the parameter r. The idea covariance matrix Σ existing, we have, with probability no is to explore the parameter r that makes the corresponding smaller than 1 − 8α, estimator attain the highest statistical efficiency, given that the breakdown point is less than a predetermined critical value. Q;r Q;r M;r M;r ||Rb − c Σ||max ≤ Using Theorems V.5 and V.6, we have ||Rb /c √− Σ||max r Q;r Q;r M;r n 2  2 log d + log(1/α) 1 2 and√ ||Rb /c − Σ||max are small when η1 c and Q;r max 2 + , η2 c are large. Therefore, we aim at finding a parameter η2 n n ¯ ¯ ¯ p r r such that the first derivatives of {F1;j, F2;j,k, F3;j,k√} or 2 ||cQ;rΣ||  2 log d + log(1/α) 1 o M;r max + . {G√1;j,G2;j,k,G3;j,k} in a small interval around r times c η2 n n or cQ;r is the highest. To this end, we separately estimate the derivatives and the Q;r √ √ Similar to Theorem V.5, when κ2, η2, ||Σ||max, and c scale parameters cM;r and cQ;r. First, we estimate the are upper and lower bounded by positive absolute constants, derivatives of {F¯1;j, F¯2;j,k, F¯3;j,k} or {G1;j,G2;j,k,G3;j,k} Q;r p the convergence rate of Rb is OP ( log d/n). Theorems using the kernel density estimator (51). For example for V.5 and V.6 imply that, under the pair-elliptical family, the calculating the derivate of F¯1;j, we propose to use the data quantile-based estimators Rb M;r and Rb Q;r can be good alter- points: natives to the sample covariance matrix. n n |X1j − Qb({Xij}i=1, 1/2)|,..., |Xnj − Qb({Xij}i=1, 1/2)|. Remark V.7. Consider the Gaussian distribution with the  diagonal values of Σ lower bounded by an absolute con- After obtaining the density estimators fb1;j, fb2;j,k, fb3;j,k or M;r Q;r stant. Then, for any fixed r ∈ (0, 1) and lower bounded {gb1;j, gb2;j,k, gb3;j,k}, we calculate the estimators bc and bc M;r Q;r κi, i = 1, 2, Assumption (A1) and (A2) are satisfied with of c and c by comparing the scale of the standard −1 −1 p η1 , η2 = O( ||Σ||max). This implies that deviation and its robust alternative (either gMAD or gQNE) for any chosen entry. We denote the empirical cumulative densities M;r M;r  p  ¯ ¯ ¯ ||Rb −c Σ||max =OP ||Σ||max log d/n to be {Fb 1;j, Fb 2;j,k, Fb 3;j,k} and {Gb1;j, Gb2;j,k, Gb3;j,k}. We also −1 −1 −1 ¯ ¯ ¯ denote their inverse function as {Fb 1;j , Fb 2;j,k, Fb 3;j,k} and −1 −1 −1 Q;r Q;r  p  {Gb1;j , Gb2;j,k, Gb3;j,k}. and ||Rb −c Σ||max =OP ||Σ||max log d/n . In the end, we define the statistics √ n  −1   −1   −1 o M;r M;r ¯ ¯ ¯ M;r Q;r q = c min fb1;j Fb 1;j (r) , fb2;j,k Fb 2;j,k(r) , fb3;j,k Fb 3;j,k(r) , Let Re and Re be the solutions to (IV.4). According b j,k √ n o to Lemma IV.1, we can also establish the concentration for Q;r Q;r −1 −1 −1 M;r Q;r q = c min g1;j(Gb1;j (r)), g2;j,k(Gb2;j,k(r)), g3;j,k(Gb2;j,k(r)) . Re and Re . b j,k b b b IEEE TRANSACTIONS ON INFORMATION THEORY 9

VII.APPLICATIONS t (A) Gaussian and gMAD (B) multivariate and gMAD We now turn to various consequences of Theorems V.5 and 4 2.5 V.6 in conducting multiple multivariate statistical methods in high dimensions. We mainly focus on the methods based on 2.0 3 the gQNE estimator, while noting that the analysis for the 1.5

2 methods based on the gMAD estimator is similar.

● ● ● ●

1.0 ● ● ● estimation error ● ● estimation error ● ● ● ● ● ● ● ● ● ● ● 1 ● ● r=0.95 r=0.95 0.5 ● r=0.75 ● ● r=0.75 ● r=0.5 r=0.5 A. Sparse Covariance Matrix Estimation r=0.25 r=0.25

r selected 0 r selected 0.0 0 20 40 60 80 100 0 20 40 60 80 100 We begin with estimating the sparse covariance matrix in dimension (d) dimension (d) high dimensions. Suppose we have (C) Gaussian and gQNE (D) multivariate t and gQNE i.i.d. 2.5 X , X ,..., X ∼ X ∈ d with covariance matrix Σ. 1.2 1 2 n R 2.0 1.0 Our target is to estimate the covariance matrix Σ itself. 0.8 1.5 ● ● ● ● ● Pearson’s sample covariance matrix performs poorly in high ● ● ● ● ● ● ● ● 0.6 ● ● ● ●

● 1.0 dimensions (52; 53). One common remedy is to assume ●

estimation error ● estimation error ● 0.4 ● that the covariance matrix is sparse. This motivates different ● r=0.95 0.5 ● r=0.95 ● ● 0.2 r=0.75 r=0.75 r=0.5 r=0.5 regularization procedures (2; 54; 55;3; 56; 57). r=0.25 r=0.25 r selected r selected 0.0 0.0 We focus on the method in (56) to illustrate the power 0 20 40 60 80 100 0 20 40 60 80 100 dimension (d) dimension (d) of quantile-based statistics. This method directly produces a positive definite estimator and does not require any extra Fig. 4. The plot of the averaged estimation accuracy for the gMAD and gQNE estimators (with r = 0.25 to 0.95 and a selected r using structure for the covariance matrix except for sparsity. The the procedure described in SectionVI). The distances are calculated model we focus on is: based on the || · ||max norm and are plotted against the dimension d COV−Q Q ranging from 2 to 100. Here the n = 50 observations are coming from M (Σ; s) := {X ∈ M (Σ): λd(Σ) > 0 standard Gaussian or multivariate t distributed data with dimension and card({(j, k): Σjk 6= 0, j 6= k}) ≤ s}. d. The results are obtained based on 200 repetitions. Motivated by the above model, we solve the following opti- M Q The estimators rb and rb are then obtained as: mization equation for obtaining the estimator Σe :   M M;r 1 rb := Truncate arg max q ; δ1, δ2 , 2 X r∈[0,1] Σe = arg min ||Σb − M||F + λ |Mj,k|, s.t. λd(M) ≥ , T 2 M=M j6=k Q  Q;r  rb := Truncate arg max q ; δ1, δ2 , (VI.1) (VII.1) r∈[0,1] where δ1, δ2 are two pre-defined constants and for any v ∈ where Σb can be any positive semidefinite matrix based on the [0, 1], we write quantile-based scatter matrix estimator for approximating the covariance matrix, λ is a tuning parameter inducing sparsity,  δ , v ≤ δ ,  1 if 1 and  is a small enough constant. Here we focus on using Truncate(v; δ , δ ) := v, δ < v < δ , 1 2 if 1 2 the gQNE estimator for generating the positive semidefinite  δ , if v ≥ δ . 2 2 matrix Σb . For estimating the covariance instead of the scatter Here we control the range of r ∈ [δ1, δ2]. This is for making matrix, we need an extra efficient estimator of the scale the procedure robust and stable. Based on the empirical results, parameter cQ;r. Here cQ;r is described in Theorem V.2. When we recommend setting δ1 = 0.25 and δ2 = 0.75. In practice, the exact distribution of the pair-elliptical is known, the value we can also compare different values of qM;r and qQ;r, cQ;r could be theoretically calculated using (V.3). On the and then choose one that achieve the best balance between other hand, when the exact distribution is unknown, we note estimation accuracy and robustness. that estimating cQ;r is equivalent to estimating the marginal To illustrate the power of the proposed selection procedure, standard deviation for at least one entry Xj of X. The let’s consider the following cases: Each time we randomly proposed procedure then is in four steps: draw n = 50 i.i.d. standard Gaussian or multivariate t (with 1. Calculate the positive semidefinite matrix Re Q;r: degrees of freedom 5) distributed data points with dimension Q;r Q;r d ranging from 2 to 100. We explore the gMAD and gQNE Re := arg min ||R − Rb ||max. (VII.2) estimators with different quantile parameter r and scale all R0 the scatter matrices such that their population versions are the Equation (VII.2) can be solved by a matrix maximum covariance matrix. We repeat the experiments for 200 times. norm projection algorithm. See Supplementary Material Figure4 illustrates the averaged estimation errors with regard SectionK for details. to the || · ||max norm. It shows that the procedure using the 2. Choose the j-th entry with the lowest empirical fourth selected parameter r has the averagely smallest estimation centered moments and use the sample standard deviation error. σbj to estimate σ(Xj). IEEE TRANSACTIONS ON INFORMATION THEORY 10

Q;r Q;r 2 3. Estimate c by Rb jj /σbj , and estimate Σ by which is the minimax optimal rate shown in (58) and the parametric rate obtained in (56). σ2 Q;r bj Q;r Σb = Q;r · Re . (VII.3) Remark VII.2. Following the steps introduced in this section, Rb Q;r Q;r Q;r jj we could similarly calculate Re , Σb , and Σe for the Q;r 4. Produce the final sparse matrix estimator Σe Q;r by plug- gMAD estimator Rb . ging Σb Q;r into (VII.1): Remark VII.3. For handling heavy tailed data, the quantile-

Q;r 1 Q;r 2 X based and rank-based methods are intrinsically different. The Σe = arg min ||Σb − M||F + λ |Mj,k|, rank-based estimator first calculates the correlation matrix, M=MT 2 j6=k then estimate the d marginal variances via employing Catoni’s −5 s.t. λd(M) ≥ , where we prefix  to be 10 as M-estimator. However, for the quantile-based estimator, as recommended in (56). (VII.3) shows, we only need to estimate one marginal variance. We then provide the statistical properties of the quantile-based Moreover, the tuning in Catoni M-estimator is unnecessary. estimator Σe Q;r as follows. B. Inverse Covariance Matrix Estimation Corollary VII.1. Suppose that X ,..., X are n indepen- 1 n This section studies estimating inverse covariance matrix. dent observations of X ∈ MCOV−Q(Σ; s) and the assump- Suppose X , X ,..., X are i.i.d. copies of X ∈ d with tions in Theorem 4.2 hold. We denote 1 2 n R covariance matrix Σ. We are interested in estimating the r n 4  4 log d + log(1/α) 1 2 precision matrix Θ := Σ−1. Precision matrix is closely ζ := max 2 + , η2 n n related to graphical models and has been widely studied in p Q;r r the literature: (59) propose a neighborhood pursuit method; 4 ||c Σ||max  4 log d + log(1/α) 1 o + , (60), (61), (62), and (63) apply penalized likelihood methods to η n n 2 obtain the estimators; (64) and (40) propose graphical Dantzig and selector and CLIME estimator; (10) and (9; 14) generalize the r n 2  log(1/α) 1 2 Gaussian graphical model to the nonparanormal distribution, ζj := max 2 + , and (14) further generalize it to the transelliptical distribution. η2 n n Q;r r In this section, we plug the covariance estimator Σb in 2pcQ;rΣ  log(1/α) 1 o jj + . (VII.3) into the formulation of CLIME proposed by (40) and η2 n n obtain the precision matrix estimator: (i) We have, with probability no smaller than 1 − 8α, Q;r X Θb = arg max |Θjk|, d×d Q;r Q;r Θ∈R ||Re − R ||max ≤ ζ. (VII.4) j,k Q;r 4 subject to kΣb Θ − Idkmax ≤ λ. (VII.7) (ii) If EXj ≤ K for some absolute constant K, then there exists an absolute constant c1 only depending on K, such The CLIME estimator in (VII.7) can be reformulated as d that when n is large enough, with probability no smaller than linear programming (40). Let A be a symmetric matrix. For −2ξ P q 1 − n − 12α, q ∈ [0, 1), we define kAkq,∞ := maxi j |Aij| . Consider- −1/2+ξ ing the set Q;r c1n 1 ζj  ||Σb − Σ||max ≤ + · Q;r cQ;r Q;r S (q, s, M) := {Θ : kΘk ≤ M and kΘk ≤ s}, Rjj − ζj Rjj − ζj d 1,∞ q,∞ ζ the following result gives the estimation accuracy of the · (||RQ;r|| + ζ) + . (VII.5) max cQ;r precision matrix. (iii) If λ takes the same value as the right-hand side of (VII.5), −1 Corollary VII.4. Assume that Θ = Σ ∈ Sd(q, s, M) for −2ξ then with probability no smaller than 1−n −12α, we have some q ∈ [0, 1) and the tuning parameter in (VII.7) satisfies Q;r 1/2 ||Σe − Σ||F ≤ 5λ(s + d) . λ ≥ kΘk1,∞ψ(r, j, ξ, α), In the following, for notation simplicity, we denote where ψ(r, j, ξ, α) is defined in (VII.6). Then there exist −1/2+ξ constants C1,C2 such that c1n 1 ζj  ψ(r, j, ξ, α) := Q;r + Q;r · Q;r Q;r 1−q Q;r c kΘ − Θk2 ≤ C1sλ , kΘ − Θkmax ≤ kΘk1,∞λ Rjj − ζj Rjj − ζj b b Q;r 2 2−q Q;r ζ and kΘb − ΘkF ≤ C2sdλ . ·(||R ||max + ζ) + . (VII.6) cQ;r Q;r If η1, c , ||Σ||max, and minj Σjj are all upper and lower Q;r p Corollary VII.1 shows that, when η1, c , ||Σ||max, and bounded by absolute constants, we choose λ = O( log d/n), minj Σjj are upper and lower bounded by positive absolute and Corollary VII.4 gives us the rate p Q;r constants, via picking λ > ψ(r, j, ξ, α) = O( log d/n), Σe 1−q Q;r  log d 2  approximates Σ in the rate kΘb − Θk2 = OP s , n r r Q;r  (s + d) log d Q;r  log d ||Σe − Σ||F = OP , kΘb − Θkmax = OP , n n IEEE TRANSACTIONS ON INFORMATION THEORY 11

and 2−q D. Discriminant Analysis 1 Q;r 2  log d 2  kΘb − Θk ≤ OP s . In this section, we consider the linear discriminant analysis d F n for conducting high dimensional classification (68; 69; 70; 71; According to the minimax estimation rate of precision 72; 12). Let data points (X ,Y ),..., (X ,Y ) be indepen- matrix estimators established in (65), the estimation rates in 1 1 n n dently drawn from a joint distribution of (X,Y ), where X ∈ terms of all the above three matrix norms are optimal. d R and Y ∈ {1, 2} is the binary label. We denote I1 = {i : Y = 1}, I = {i : Y = 2}, and n = |I |, n = |I |. Define C. Sparse Principal Component Analysis i 2 i 1 1 2 2 π = P(Y = 1), µ1 = E(X | Y = 1), µ2 = E(X | Y = 2), In this section we consider conducting principal component Σ = Cov(X | Y = 1) = Cov(X | Y = 2). If the classifier analysis (PCA) and sparse PCA. Recall that in (sparse) prin- is defined as h(x) = I(f(x) < c) + 1 for some function f cipal component analysis, we have and constant c, we measure the quality of classification by i.i.d. d employing the Rayleigh quotient (17): X1, X2,..., Xn ∼ X ∈ R with covariance matrix Σ, Var{ [f(X) | Y ]} and our target is to estimate the eigenspace spanned by the Rq(f) = E . Var{f(X) − [f(X) | Y ]} m leading eigenvectors u1,..., um of Σ. The conventional E PCA uses the leading eigenvectors of the sample covariance For the linear functions f(x) = βT x+c, the Rayleigh quotient matrix for estimation. In high dimensions where d can be much has the formulation larger than n, a sparsity constraint on u1,..., um is sometimes [βT (µ − µ )]2 recommended (66), motivating a series of methods referred to Rq(β) = π(1 − π) 1 2 . as sparse PCA. βT Σβ d×m In this section, let Um := (u1,..., um) ∈ R represent The Rayleigh quotient is minimized when β = β∗ = −1 the combination of eigenvectors of interest. We aim to estimate Σ (µ1 − µ2). When X|(Y = 1) and X|(Y = 2) are T the projection matrix Πm := UmUm. The model we focus multivariate Gaussian distribution, it matches the Fisher’s T ∗ ∗ on is: linear discriminant rule hF(x) = I(x β + c ) + 1, where n ∗ T ∗ ∗ ∗ MPCA−Q(Σ; s, m) := X ∈ MQ(Σ): c = (µ1 + µ2) β /2. In order to estimate β and c , we apply the estimator proposed in (71): d m X  X 2  o Q;r I ukj 6= 0 ≤ s, λm(Σ) − λm+1(Σ) > 0 , βb = arg max kβk1, d j=1 k=1 β∈R Q;r where ukj is the j-th entry of uk and λm(Σ) is the m-th subject to kΣb β − (µb1 − µb2)kmax ≤ λ, (VII.9) largest eigenvalue of Σ. Motivated from the above model, where µ , µ are some estimators of µ , µ . via exploiting the gQNE estimator, we propose the quantile- b1 b2 1 2 In the following, we suggest two kinds of estimators of based (sparse) PCA estimators (Q-PCA) as the optimum to the mean vectors. For simplicity, we only describe the es- the following Fantope projection problem (67): timator for µ1 and similar procedures can also be applied Q;r T Q;r Πb m = arg max Tr(Π Rb ) − λkΠk1,1, to estimate µ2. Suppose that X1,..., Xn1 correspond to Π∈ d×d R Y = 1. The first estimator for µ1 is the sample median subject to 0  Π  I and Tr(Π) = m, (VII.8) T d µbM = (µbM1,..., µbMd) where P where kΠk1,1 = |Πjk| and A  B implies 1≤j,k≤d µM = median({X1j,..., Xn j}). Q;r b j 1 B − A is a positive semidefinite matrix. Intrinsically Πb m is the estimator of the eigenspace spanned by the m leading Due to the Hoeffding’s inequality, we can derive that Q;r eigenvectors of R . Because the scatter matrix shares the there exists some constant cm that P(|µMj − µj| > p b same eigenspace with the covariance matrix, under the model cm log(1/δ)/n) ≤ δ for any j = 1, . . . , d. PCA-Q Q;r M (Σ; s, m), Πb m is also an estimator of Πm. We then (19) proposes an alternative M-estimator for µ1. Consider- have the following corollary, stating that the Q-PCA estimator ing a strictly increasing function h such that − log(1 − y + 2 2 achieves the parametric rate of convergence in estimating Πm y /2) ≤ h(y) ≤ log(1 + y + y /2) and some δ ∈ (0, 1) PCA-Q 2 2 under the model M (Σ; s, m). satisfying n > 2 log(1/δ), v ≥ max{σ1, . . . , σd}, we define Corollary VII.5. Suppose that X ,..., X are n indepen- 2 log(1/δ) 1 n α2 = . dent observations of X ∈ MPCA−Q(Σ; s, m) and the as- δ 2v log(1/δ)  n1 v + sumptions in Theorem V.6 hold. If the tuning parameter in n−2 log(1/δ) T (VII.8) satisfies λ ≥ ψ(r, j, ξ, α) (reminding that ψ(r, j, ξ, α) The estimator µbC = (µbC1,..., µbCd) is obtained by solving is defined in (VII.6)), we have the equation:

Q;r 4sλ n1 kΠb m − ΠmkF ≤ . X λm(Σ) − λm+1(Σ) h(αδ(xij − µbCj)) = 0. Corollary VII.5 shows that, under appropriate condi- i=1 tions, the convergence rate of the Q-PCA estimator is It is shown in (19) that there exists some constant C such that p −1 2 OP (s log d/n), which is the parametric rate in (67). with probability at least 1 − (n ∨ d) , P kµbCj − µCjk ≥ IEEE TRANSACTIONS ON INFORMATION THEORY 12

2v log(1/δ)  n−2 log(1/δ) ≤ δ for all j = 1, . . . , d. By union bound, for TABLE I both estimators we have COMPARISON OF THE GMAD, GQNE,PEARSON’S,KENDALL’S TAU AND r SPEARMAN’S RHO COVARIANCE ESTIMATORS COMBINED WITH THE  log d CLIME AND SPARSE PCA(SPCA) ALGORITHMS.THE ESTIMATION kµbM − µ1k∞ ∨ kµbC − µ1k∞ = OP . (VII.10) ERRORSOF CLIME ESTIMATORSAREINTERMSOF FROBENIUSNORMS. n THE ESTIMATION ERRORS OF SPCA ESTIMATORSAREINTERMSOF Therefore, we have the following result on how well the kΠb 1 − Π1kF.THEERRORSAREAVERAGEDOVER 100 REPETITIONS Rayleigh quotient of estimated classifier can approximate the WITH STANDARD DEVIATIONS IN THE PARENTHESES. optimal one. Scheme 1 Scheme 2 Corollary VII.6. We assume that n1  n2, ∆d = (µ1 − d 100 200 500 100 200 500 µ )T β∗ ≥ M for some constant M > 0. If we choose the 5.47 6.29 8.82 7.73 9.06 11.09 2 gMAD (0.22) (0.28) (0.60) (0.64) (0.80) (0.92) tuning parameter in (VII.9) that λ ≥ kµ1 −µ2 −µ1 +µ2k∞ + b b 4.75 5.42 7.53 7.58 8.76 10.95 ∗ gQNE ψ(r, j, ξ, α)kβ k1, we have (0.20) (0.19) (0.22) (0.68) (0.79) (1.03) 4.75 5.38 7.39 7.77 8.95 11.37 Q;r Pearson Rq(βb ) −1 ∗ 2 −1 ∗ (0.22) (0.22) (0.26) (0.82) (1.05) (1.45) ≥ 1−2M kβ k kψ(r, j, ξ, α)−20M λkβ k1. ∗ 1 CLIME 5.38 6.35 8.84 8.47 10.01 13.34 Rq(β ) Kendall (0.23) (0.25) (0.29) (0.55) (0.62) (0.66) Q;r 4.86 5.49 7.60 7.99 9.17 11.33 According to the rate in (VII.10), if η1, c , ||Σ||max, and Spearman (0.20) (0.20) (0.30) (0.64) (0.78) (1.04) minj Σjj are universally bounded from above and below by ∗ 0.35 0.41 0.48 0.48 0.53 0.60 positive absolute constants and kβ k1 is also upper bounded gMAD (0.23) (0.27) (0.26) (0.25) (0.25) (0.27) by a constant independent of d and n, we choose the tuning 0.19 0.25 0.34 0.37 0.43 0.47 p gQNE parameter λ = O( log d/n) and obtain the rate through (0.16) (0.24) (0.26) (0.26) (0.26) (0.29) 0.17 0.20 0.31 0.78 0.80 0.91 Corollary VII.6 as Pearson (0.14) (0.20) (0.28) (0.27) (0.23) (0.16) Q;r r SPCA 0.20 0.23 0.32 0.45 0.53 0.58 Rq(βb )  log d Kendall ≥ 1 − O . (0.15) (0.20) (0.28) (0.30) (0.28) (0.30) Rq(β∗) n 0.21 0.25 0.34 0.46 0.54 0.57 Spearman (0.16) (0.23) (0.28) (0.32) (0.28) (0.29) The above result matches the parametric rate in (71).

VIII.EXPERIMENTS

This section compares the empirical performance of • Scheme 2: X1,..., Xn are i.i.d. copies of X following quantile-based estimators to these based on the Pearson, the multivariate t-distribution with the degree of freedom Kendall’s tau, and Spearman’s rho covariance estimators for 5 and Cov(X) = Σ both synthetic and real data. Since for synthetic data analysis, conducting inverse covariance estimation and sparse PCA has In the following, we plug the five covariance matrix estima- reflected the quality of matrix estimation very well, these two tors into the inverse covariance estimation procedure discussed methods are the main focus for synthetic data. For the real in Section VII-B and the sparse PCA procedure in Section data, on the other hand, we aim to apply our methods to the VII-C. classification of genomic data. 1) Inverse Covariance Matrix Estimation: We consider the numerical performance using the setting in (40). Let Ω = −1 A. Synthetic Data (ωij)1≤i,j≤d := Σ represent the inverse covariance matrix. We consider Ω = B + δI . Each off-diagonal entry of B This section focuses on conducting inverse covariance ma- d is independent and of the value 0.5 with probability 0.1 and trix estimation and sparse PCA. Let {σ ,..., σ } be the b1 bd the value 0 with probability 0.9. The value δ is chosen such X SP SK SS sample standard deviations of . Let b , b and b be the that the condition number of Ω is d. The diagonal of Σ is Pearson’s, Kendall’s tau and Spearman’s rho sample corre- renormalized to be ones. lation matrices. We compare Σb Q,r and Σb M,r introduced in Section VII-A4 to three covariance matrix estimators: We generate the data under the three schemes with dimen- sions d = 90, 120, 200, sample size n = 100, and repetition 1) Pearson’s sample covariance matrix: ΣP = σ σ SP , for b ij bibj bij time 100. We measure the estimation error by Frobenius norm i, j = 1, . . . , d.; K K kΣb − ΣkF. The numerical results are presented in TableI. 2) Kendall’s tau covariance matrix: Σb ij = σbiσbjSbij, for i, j = 1, . . . , d.; 2) Sparse PCA: We use the setting in (13) to investigate 3) Spearman’s rho covariance matrix: ΣS = σ σ SS , for the numerical performance of the sparse principal component b ij bibj bij T i, j = 1, . . . , d. analysis. We consider the spike covariance Σ = λ1√v1v1 + λ v vT + I , where λ = 5 and λ = 2, v = 1/ 10 for Moreover, given a covariance matrix Σ, we consider the 2 2 2 d 1 2 1j √ j = 1 ≤ j ≤ 10, v = 0 for j > 10 and v = 1/ 10 for following three schemes for data generation: 1j 2j 11 ≤ j ≤ 20, v2j = 0 otherwise. The data sample size is n = • X ,..., X X ∼ Scheme 1: 1 n are i.i.d. copies of 50 for d = 90, 120, 200 with repetition 100 times. We measure Nd(0, Σ); T the difference between the true projection matrix Π1 = v1v1 4 and its estimator Π through (VII.8) by the Frobenius norm Here the parameter r is set as in SectionVI with δ1 and δ2 to be 0.25 b 1 and 0.75. kΠb 1 − Π1kF. The results are given in TableI. IEEE TRANSACTIONS ON INFORMATION THEORY 13

3) Summary of Synthetic Data Results: From the numer- TABLE II ical results summarized in TableI, the gMAD and gQNE MEANS (STANDARD DEVIATIONS IN THE PARENTHESES) OF CLASSIFICATION ERRORS ON GPL96 FOR PROSTATE TUMOR AND B CELL covariance estimators perform better than other estimators LYMPHOMA BY APPLYING THE GMAD, GQNE,PEARSON’S,KENDALL’S for both CLIME and sparse PCA under Scheme 2 and TAU AND SPEARMAN’S RHO COVARIANCE ESTIMATORS TO THE LINEAR Scheme 3. This implies that gMAD and gQNE have better DISCRIMINATIVE CLASSIFIER WITH 100 REPLICATIONS. performance in studying heavy-tailed distributions and con- taminated data. This is due to the advantage that only one gMAD gQNE Pearson Kendall Spearman variance is required to be estimated in (VII.3), while the 0.055 0.059 0.134 0.099 0.098 other covariance estimators demand variance estimators for (0.003) (0.004) (0.003) (0.007) (0.005) d covariates. Moreover, gQNE outperforms gMAD estimator. This matches the assertion that gQNE is more efficient than gMAD. Under the Scheme 1, since the synthetic data fol- as (j∗, k∗). Figure5(C) reports the estimated density function n low Gaussian distribution, the sample covariance estimator is of the t-statistics from the selected pair {t((Xij∗ ,Xik∗ ))}i=1 efficient and outperforms the gMAD and gQNE covariance versus the density of the t-distribution with degree of freedom estimators. However, the performance is still comparable even 1. The density function is estimated by kernel density estimator under this scheme. with bandwidth h = 0.4. We see that these two distributions are close to each other. Secondly, we plug the samples of these selected genes to B. Genomic Data Analysis the linear discriminative classifier in (VII.9). To compare the In this section, we apply the method introduced in Section classification errors for different covariance estimators, each VII-D to the large-scale microarray dataset, GPL96 (73). The time we randomly select 74 samples from the prostate tumor dataset contains 20,263 probes with 8,124 samples belonging and 74 from the B cell lymphoma as the training dataset. to 309 tissue types. The probes correspond to 12,679 genes For the training dataset, we divide each category into two and we average the probes from same genes. The tissue types parts randomly: One is to derive the classifiers from different include prostate tumor (148 samples), B cell lymphoma (213 covariance estimators by (VII.9) and the other is for tuning samples), and many others. The target of our real data analysis the parameters of the first part by minimizing its classification is to compare the performance of classifying the prostate tumor error. The rest samples are applied as the testing dataset to from the B cell lymphoma by using gMDA, gQNE, and other calculate the final classification error. The above steps are estimators. We only focus on the top 1,000 genes of the largest repeated for 100 times. We summarize the classification errors p-values in performing the marginal two-sample t-test. in TableII. The errors reported in TableII demonstrate that First, we consider the goodness of fit test of the pair- the gMAD and the gQNE estimators significantly outperform elliptical. Our goal is to select genes such that their corre- the other estimators in the GPL96 dataset. This indicates the sponding data from both tissue types are approximately pair- power of quantile-based statistics in high dimensions. elliptical separately. Then we apply the linear discriminative classifier in (VII.9) to the selected genes for classification. n Specifically, let {Xi}i=1 be the dataset corresponding to the IX.DISCUSSION prostate tumor. We employ the goodness of fit test proposed in Section III-C by calculating the statistic Z({(Xij,Xik)}) This paper studies estimating large scatter matrices for high for each pair of 1 ≤ j 6= k ≤ d. The Holm’s adjusted p-values dimensional data with possibly very heavy tails. We propose a H d new distribution family, the pair-elliptical, for modeling such {πjk}j,k=1 are evaluated according to (III.5). We delete any gene j if there exists another gene k such that the adjusted p- data. The pair-elliptical is more flexible than the elliptical. We H also characterize the relation between the pair-elliptical and value πjk < α = 0.05. All the left ones are the selected genes for the prostate tumor. Same procedures are also applied to several popular distribution families in high dimensions. Built the dataset of B cell lymphoma. The final screened genes are on the pair-elliptical family, we advocate using the quantile- the intersection of the selected genes for both categories. In based approaches for estimating large scatter and covariance the end, the number of selected genes is 225. matrices. These procedures are both statistically efficient and Figure5 illustrates the results for the above procedure robust compared to their Gaussian alternatives. corresponding to the prostate tumor. In detail, Figure5(A) In the future, it is of interest to investigate the performance reports the histogram of the test statistics Z({(Xij,Xik)}) of the quantile-based methods when the data are not inde- for all pairs of genes corresponding to the prostate tumor. pendent. Studies of moment-based and rank-based approaches 2 under high dimensional dependent data include (74) and (75). The curve in the figure√ is the density of χ distribution with degrees of freedom [ n] − 1 = 11. Figure5(A) shows However, their proof techniques cannot be directly applied to that the empirical distribution of the statistics is close to the analyze the quantile-based estimators. χ2 distribution. Figure5(B) illustrates the histogram of the All the results in this paper are confined in parameter Holm’s adjusted p-values for all pairs of genes. Figure5(B) estimation and focus on nonasymptotic analysis. In the future, shows that we do not reject the goodness of fit test of the we are also interested in studying the asymptotic properties of pair-elliptical for most pairs of genes in the dataset. We select quantile-based estimators under the regime that both d and n the pair of genes with the largest p-value and denote the pair go to infinity. IEEE TRANSACTIONS ON INFORMATION THEORY 14

(A) Histogram of Z({(Xij ,Xik)}) Proof. Let Fn denote the empirical distribution of X1,...,Xn −1 0.10 and Fn (r) := Qb({Xi}; r). By the definition of Qb(·; ·) in (IV.1), we have for any  ∈ [0, 1], 0.08

−1 1

0.06  ≤ F (F ()) ≤  + . n n n Density

0.04 This implies that

0.02 n o n −1 −1 o P Qb({Xi}; r) − Q(X; r) ≥ t = P Fn (r) − F (r) ≥ t 0.00 h 1 −1 i 0 10 20 30 40 50 60 ≤ P r + ≥ Fn{t + F (r)} H d n (B) Histogram of {πjk}j,k=1 n

40 h 1 X 1 i = I{X ≤ F −1(r) + t} ≤ r + . P n i n i=1 30 We have

20 n Density h 1 X 1 i I{X ≤ F −1(r) + t} ≤ r + P n i n i=1 10 n h 1 X = I{X ≤ F −1(r) + t} − F {F −1(r) + t} P n i

0 i=1 1 i 0.00 0.05 0.10 0.15 0.20 0.25 ≤ r + − F {F −1(r) + t} . n (C) Density of {t((Xij ,Xik))}i=1 v.s. t1 n

0.30 −1 −1 Because EI{Xi ≤ F (r) + t} = P(X ≤ F (r) + t) = −1 −1 F {F (r)+t} and I{Xi ≤ F (r)+t} ∈ [0, 1] is a bounded 0.25 random variable, by Hoeffding’s inequality, we have 0.20 n o

0.15 P Qb({Xi}; r) − Q(X; r) > t

 −1 −1 2

0.10 ≤ exp − 2n[F {F (r) + t} − r − n ] , (A.1) 0.05 as long as t is large enough such that F {F −1(r) + t} >

0.00 r + n−1. −10 −5 0 5 10 On the other hand, we have

Fig. 5. (A) The histogram of Z({(Xij ,Xik)}) for the pairs of genes for the prostate tumor and the curve is the density function of χ2 ; (B) n o n −1 −1 o 11 Qb({Xi}; r) − Q(X; r) ≤ −t = F (r) − F (r) ≤ −t The histogram of the Holm’s adjusted p-values for the prostate tumor P P n and the red dashed line is the significance level α = 5%; (C) The h −1 i n ≤ P r ≤ Fn{F (r) − t} estimated density of {t((Xij∗ ,Xik∗ ))}i=1 (black solid line) versus the density of t1 (red dashed line) for the pair of genes with largest n h 1 X −1 i p-value of the statistic Z({(Xij ,Xik)}). = I{X ≤ F (r) − t} ≥ r . P n i i=1

APPENDIX Then again, exploiting the Hoeffding’s inequality as above, we have Lemma A.1. For any continuous random variable X ∈ R with the distribution function F and any n independent observations n o X1,...,Xn of X, we have for any t > 0, P Qb({Xi}; r) − Q(X; r) ≤ −t   ≤ exp − 2n[r − F {F −1(r) − t}]2 . (A.2) P(|Qb({Xi}; r) − Q(X; r)| ≥ t)  −1 −1 2 ≤ exp − 2n[F {F (r) + t} − r − n ] Combining (A.1) and (A.2), we have the desired result.   + exp − 2n[r − F {F −1(r) − t}]2 , Lemma A.2 (gMAD concentration inequality). Letting X ∈ whenever F {F −1(r) + t} > r + n−1. R be a one dimensional continuous random variable. We IEEE TRANSACTIONS ON INFORMATION THEORY 15

denote F1 and F2 to be the distribution functions of X and Moreover, we have |X − Q(X; 1/2)|. Then we have for any t > 0, n     t o

A1 + A2 = P Qb {|Xi − ν|}, r − Q |X − ν|, r >   2 M M 2 P |σb (X; r) − σ (X; r)| > t  h −1 1 i  ≤ exp − 2n F2{F2 (r) + t/2} − r −  h 1 1 i2 n −1 2 ≤ 2 exp − 2n F1{F1 (1/2) + t/2} − − +  h −1 i  2 n + exp − 2n r − F2{F2 (r) − t/2} ,  h1 i2 2 exp − 2n − F {F −1(1/2) − t/2} + 2 1 1 where we remind that F1 and F2 represent the distribution  h 1 i2 functions of X and |X − Q(X; 1/2)|. Finally, using the fact exp − 2n F {F −1(r) + t/2} − r − 2 2 n that  h i2   −1 σM(X; r) − σM(X; r) > t ≤ A + A + 2B, + exp − 2n r − F2{F2 (r) − t/2} , P b 1 2

−1 −1 we have the desired result. whenever F1{F1 (1/2)+t/2}−1/2 > 1/n and F2{F2 (r)+ t/2} − r > 1/n. Lemma A.3 (gQNE concentration). Let X ∈ R be a one dimensional continuous random variable and Xe be an inde- pendent copy of X. Let G(·) be the distribution function of Proof. By definition, we have for any t > 0, |X − Xe|. We then have

 M M  Q Q  −1 P σb (X; r)−σ (X; r)>t P(|σb (X; r) − σ (X; r)| > t) ≤ exp − n[G{G (r) + t}  hn n 1o o i −1 2  −1 2 = Qb Xi −Qb {Xi}; ; r − r − n ] + exp − n[r − G{G (r) − t}] , P 2 n  1 o  −1 −1 −Q X − Q X; ; r > t . whenever G{G (r) + t} > r + n . 2 Proof. We denote Z := |X − Xe|. By definition, we have We denote ν := Q(X; 1/2) and ν := Q({X }; 1/2). We then b b i (|σQ(X; r) − σQ(X; r)| > t) have P b = P(|Qb({|Xi − Xj|}i t).  M M  P σb (X; r)−σ (X; r)>t Similar as in the proof of Lemma A.1, we have n     o n o = Qb {|Xi − ν|}, r − Q |X − ν|, r > t P b P Qb({|Xi − Xj|}i t h 2 X 1 i P b i b ≤ I{|X − X | ≤ G−1(r) + t} ≤ r + . P n(n − 2) i j n n     o i t/2 | {z } By Hoeffding’s inequality in U-statistics (c.f., Equation (5.7) A1 in (76)), we have + P(|νb − ν| > t/2) . n o | {z } Qb({|Xi − Xj|}i t B P  −1 −1 2 On the other hand, we have ≤ exp − n[G{G (r) + t} − r − n ] , (A.3) where we remind that G represents the distribution function  M M  P σb (X; r)−σ (X; r)<−t of Z. Similarly, we have n     o n o = P Qb {|Xi − νb|}, r − Q |X − ν|, r < −t P (Qb({|Xi − Xj|}i t/2) . | {z } Proof of Proposition III.7. Consider a pair-elliptically dis- B T tributed random vector X = (X1,...,Xd) Using Lemma A.1, we have ∼ PEd(µ, S, ξ1). If X is also transelliptically distributed, by definition, there exists a set of univariate strictly in- 2 d  h −1 1 1 i  creasing functions f = {fj}j=1 such that f(X) := B ≤ exp − 2n F1{F1 (1/2) + t/2} − − T 0 2 n (f1(X1), . . . , fd(Xd)) ∼ ECd(0, Σ , ξ2) for some gener-  h1 i2 ξ X + exp − 2n − F {F −1(1/2) − t/2} . ating variable 2. Because has the same Kendall’s tau 2 1 1 correlation matrix as f(X), without loss of generality, we can IEEE TRANSACTIONS ON INFORMATION THEORY 16 assume S = Σ0. Accordingly, the margins of X follow the C. Proofs of Theorems II.1 same distribution. Combined with the fact that the margins of Proof of Theorem II.1. Let X1,..., Xn be n independent f(X) are identically distributed, we have the transformation d T observations of X ∈ R with Xi = (Xi1,...,Xid). By functions f1 = f2 = ··· = fd = f0. The desired result follows Chebychev’s inequality, we have by considering the following two cases. If |Σ0 | = 1 for all jk −p j, k ∈ {1, . . . , d}, then we have X1 = (±)X2 = ··· = (±)Xd P(|Xij| ≥ t) ≤ (t/||Xij||Lp ) . almost surely. This, combined with the fact that X are pair- Let’s cut Xij into two parts: elliptically distributed, implies that X is elliptical distributed. 0 Y = X 1(|X | ≥ ) and Y ∗ = X 1(|X | ≤ ). Otherwise, if there exists j 6= k such that |Σjk| ∈ (0, 1), then ij ij ij ij ij ij without loss of generality, we assume j = 1, k = 2 and let We have |Y ∗ − Y ∗| is upper bounded by 2 and its variance ρ := Σ0 6= 0. Because (X ,X )T is elliptically distributed, ij E ij 12 1 2 has the property: we then have Var(Y ∗) ≤ (Y ∗)2 ≤ ||X ||2 ≤ ||X ||2 = K2. 2 0 ij E ij ij L2 ij Lp X2|X1 = X1 ∼ EC1(ρX1, 1 − ρ , ξ1), 0 Accordingly, by Bernstein’s inequality, we have for some generating variable ξ1, which implies that  1 X   1/8 · nt2  median(X2|X1 = X1) = ρX1 ( Y ∗ − Y ∗) > t/2 ≤ exp − . P n ij E ij K2 + 1/3 · t median(f0(X2)|f0(X1) = f0(X1)) = f0(ρX1). (A.5) T For Yij, we have On the other hand, because (f0(X1), f0(X2)) is elliptically −p distributed, we also have P(Yij 6= 0) = P(|Xij| ≥ ) ≤ (/K) , 2 0 f0(X2)|f0(X1) = f0(X1) ∼ EC1(ρf0(X1), 10ρ , ξ2), and 0 p p for some generating variable ξ , which implies that |Xij| K 2 | Y | = | X 1(|X | ≥ )| ≤ E = . E ij E ij ij p−1 p−1 median(f0(X2)|f0(X1) = f0(X1)) = ρf0(X1). (A.6) Accordingly, for any  such that Combining Equations (A.5) and (A.6), we have for all X1 ∈ R, p p−1 f0(ρX1) = ρf0(X1), implying that f0(x) = ax for some K / ≤ t/2, (A.7) a ∈ . This further implies that X ∼ EC (0, a2Σ0, ξ ) is R d 2 we have for any j ∈ {1, . . . , d}, elliptically distributed.  1 X  X − µ > t Proof of Proposition III.8. By definition, any pair in Y satis- P n ij j fies  1 X   1 X  ≤ (Y ∗ − Y ∗) > t/2 + (Y − Y ) > t/2 T D P ij E ij P ij E ij (Yj,Yk) = ξGAU ∼ N2(µ{j,k}, S{j,k},{j,k}), n n 2 2×q T  1/8 · nt  where q = rank(S{j,k},{j,k}), A ∈ R with AA = ≤ 2 exp − 2 + nP(Yij 6= 0) q q−1 2 K + 1/3 · t S{j,k},{j,k}, U ∈ R uniformly distributed in S , and ξG  1/8 · nt2  follows a chi square distribution with degree of freedom q. ≤ 2 exp − + n(/K)−p. This further implies that any pair in X satisfies K2 + 1/3 · t T T This yields that (Xj,Xk) = ξ(Yj,Yk) ∼ ξ · ξG · AU,  1  and accordingly is elliptically distributed. This verifies that X X P(||µ¯ − µ||∞ > t) ≤d · P Xij − µj > t is pair-elliptically distributed. n   1/8 · nt2   ≤d · 2 exp − + n(/K)−p . Proof of Proposition III.9. Because the margin of a pair- K2 + 1/3 · t normal is normally distributed and the only elliptical distri- p p bution that has normal margins is the Gaussian, we have the Taking t = 12K · log d/n and  = K · n/(log d), as long first assertion is true. We then prove the second assertion. as log d/n = o(1), we have T Suppose that Y = (Y1,...,Yd) is pair-normally as well as 2  1/8 · nt  −2.5 j ∈ {1, . . . , d} Y 2d exp − ≤ 2d . nonparanormlly distributed. Then for any , j is K2 + 1/3 · t normally distributed. Moreover, because Y is nonparanoram- It is also straightforward to verify that for such chosen t and lly distributed, we have fj(Yj) ∼ N(0, 1). This implies that fj p p−1 is a linear transformation. Therefore, Y is Gaussian distributed , we have K / ≤ t/2 as long as p ≥ 2. For the second term, we have because f1, . . . , fd are linear. log(n−p · Kp) = log n − p log  + p log K B. Proofs of Lemma IV.1 = log n − p(log K + 1/2 log n − 1/2 log log d)) + p log K Proof. Since Re is the minimizer of (IV.4), we have = (−p/2 + 1) log n + p/2 log log d. kRe − Rb k ≤ kRe − Rb k, So for large enough p and K, we need to have which implies P(kRe − Rk ≥ t) ≤ P(kRe − Rb k + kRb − Rk ≥ t) ≤ P(kRb − Rk ≥ t). log d + (−p/2 + 1) log n + p/2 log log d → −∞. IEEE TRANSACTIONS ON INFORMATION THEORY 17

Thus, letting p = 2 + 2γ + δ, for d = O(nγ ), with probability Using a similar technique as above, we can further derive that −2.5 p/2 −δ/2 no smaller than 1 − 2d − (log d) n , we have M σ (Xj + Xk; r) = r log d |Xj + Xk − Q(Xj + Xk; 1/2)|  ||µ¯ − µ||∞ ≤ 12K . σ(Xj + Xk) · Q ; r , n σ(Xj + Xk) M This completes the proof. σ (Xj − Xk; r) = Proof of Theorem II.2. Consider the distribution X = |Xj − Xk − Q(Xj − Xk; 1/2)|  σ(Xj − Xk) · Q ; r . (X1,...,Xd) with X1,...,Xd independent and identically σ(Xj − Xk) distributed. For reasons that will be clear later, we can set Therefore, RM;r = cM;rΣ if (V.1) holds and by definition C = 1 and consider M 2 nσ (Xj; r)o p 1 −p/2 cM;r = (X1 = n log d) = (n log d) , P 2 σ(Xj) 2 p 1 −p/2 n |Xj − µj| o P(X1 = − n log d) = (n log d) , = Q ; r . 2 σ(Xj) and (X = 0) = 1 − (n log d)−p/2. (A.8) P 1 This completes the proof. We then have µ = 0 and ||X || → I(q = p)+∞·I(q > p). 1 Lq Proof of Theorem V.2. The proof of Theorem V.2 is similar Moreover, letting as Theorem V.1, and is accordingly omitted here. p α := P(|µ¯1| ≥ log d/n) ≥ 1 − P(X1 = ... = Xn = 0) E. Proof of Theorem V.3 M;r ≥ n(n log d)−p/2(1 + o(1)), (A.9) Proof. We first prove the case for the gMAD estimator R . By the definition of the pair-elliptical and the discussions in we have Remark III.2, we have any pair in PEd(µ, S, ξ) can be written p d 2 d 3 as EC2(µ{j,k}, S{j,k},{j,k}, φ) where φ is a properly defined P(||µ¯ − µ||∞ ≥ log d/n) = dα − (2)α + (3)α − · · · d characteristic function only depending on ξ. On one hand, by = 1 − (1 − α) . Theorem 2.16 in (41), we have for any j ∈ {1, . . . , d},

γ+δ0 When p = 2 + 2γ and d = n with δ0 > 0 and some Xj − µj ∼ EC1(0, 1, φ). (A.10) 0 < δ1 < δ0, we have p Sjj  nγ+δ0 (1 − α)d ≤ 1 − n−γ (log d)−p/2 Because X is continuous (by the definition of the pair- elliptical) and accordingly all quantiles of ECd(0, 1, ξ) are γ+δ0  1 n δ −δ ≤ 1 − ≤ (1/e)n 0 1 → 0. uniquely defined, using the same proof techniques as exploited nγ+δ1 in the proof of Theorem V.1,(A.10) shows that Accordingly, with probability tending to 1, we have   2 M;r p Rjj = Q EC1(0, 1, φ) ; r · Sjj ||µ¯ − µ||∞ ≥ log d/n. √ for any j ∈ {1, . . . , d}. On the other hand, for those j 6= k When C 6= 1, we can replace all terms n log d in Equation √ such that Sjj + Skk + 2Sjk 6= 0 and Sjj + Skk − 2Sjk 6= 0, (A.8) with Cn log d and all the proofs can follow. we have

D. Proofs of Theorems V.1 and V.2 Xj + Xk ∼ ECd(µj + µk, Sjj + Skk + 2Sjk, φ), and Xj − Xk ∼ ECd(µj − µk, Sjj + Skk − 2Sjk, φ), Proof of Theorem V.1. For j = 1, . . . , d, we denote µj := Q(Xj; 1/2). By definition, we have for j = 1, . . . , d, and accordingly Xj +Xk and Xj −Xk are both elliptically dis- tributed with the same characterization function φ. Therefore, σM(X ; r) = Q(|X − Q(X ; 1/2)|; r) = Q(|X − µ |; r). j j j j j we have

In other words, we have Xj + Xk − Q(Xj + Xk; 1/2) M pS + S + 2S r = P{|Xj − µj| ≤ σ (Xj; r)} jj kk jk n|X − µ | σM (X ; r)o D Xj − Xk − Q(Xj − Xk; 1/2) j j j = ∼ EC1(0, 1, φ), = P ≤ , pS + S − 2S σ(Xj) σ(Xj) jj kk jk because all the quantiles are unique. Accordingly, we have and accordingly we have

M X − µ D X + X − Q(X + X ; 1/2) σ (Xj; r) |Xj − µj|  j j j k j k = Q ; r p = p σ(Xj) σ(Xj) Sjj Sjj + Skk + 2Sjk |X − µ |  D Xj − Xk − Q(Xj − Xk; 1/2) M j j = . ⇒ σ (Xj; r) = σ(Xj) · Q ; r . p σ(Xj) Sjj + Skk − 2Sjk IEEE TRANSACTIONS ON INFORMATION THEORY 18

Thus, in parallel to the proof in Theorem V.1, we have whenever t/2 ≤ κ1 and η1t/2 > 1/n. Accordingly, we have M;r M;r   2 P(|Rb jj − Rjj | > t) M;r Rjk = Q EC1(0, 1, φ) ; r · Sjk. M M M = P(|(σb (Xj; r) + σ (Xj; r))(σb (Xj; r) − σM(X ; r))| > t) j 6= k ∈ {1, . . . , d} S +S −2S = 0 j For those such that jj kk jk , M M 2 ≤ (|(σ (Xj; r) − σ (Xj; r)) we have Cov(Xj − Xk) = 0, implying that Xj = Xk, a.s.. P b M M M Accordingly, we have + 2σ (Xj; r)(σb (Xj) − σ (Xj; r))| > t) r  M M t  M 1 M 2 M 2 ≤ P |σb (Xj; r) − σ (Xj; r)| > σ (Xj,Xk; r) = {σ (Xj+Xk; r)} = (σ (Xj+Xk; r)) . 2 4  t  + |σM(X ; r) − σM(X ; r)| > P b j j M   2 2σ (Xj; r) M;r M;r Accordingly, Rjk = Rjj = Q EC1(0, 1, φ) ; r · p 2 ≤ 6 exp(−2n(η1 t/2/2 − 1/n) )   2 M 2 Sjj = Q EC1(0, 1, φ) ; r · Sjk. Similarly, for those + 6 exp(−2n(η1t/(4σ (Xj; r))) ). (A.11) j 6= k ∈ {1, . . . , d} such that Sjj + Skk + 2Sjk = 0, we   2 Using Lemma A.2 and Assumption (A1), we have for any M;r M;r have Rjk = −Rjj = − Q EC1(0, 1, φ) ; r · Sjj = j 6= k,   2 M M Q EC1(0, 1, φ) ; r · Sjk. This completes the proof of P(|σb (Xj + Xk; r) − σ (Xj + Xk; r)| > t) 2 2 the first part. ≤ 3 exp(−2n(η1t/2 − 1/n) ) + 3 exp(−2n(η1t/2) ), Secondly, we switch to the gQNE estimator. We note that, M M P(|σb (Xj − Xk; r) − σ (Xj − Xk; r)| > t) by Remark III.2, 2 2 ≤ 3 exp(−2n(η1t/2 − 1/n) ) + 3 exp(−2n(η1t/2) ).

T 2 T p M;r E exp(it (X{j,k} − Xf{j,k})) = φ (t S{j,k},{j,k}t), And accordingly, letting θmax := 2 ||R ||max, we have M;r M;r P(|Rb − c Σjk| > t) and accordingly X−X ∼ PE (0, 2S, φ2) and is continuously jk f d M 2 M 2 distributed. Then by following the same proof as in the first ≤P(|(σb (Xj + Xk; r)) − (σ (Xj + Xk; r)) | > 2t) M 2 M 2 part we have the desired result. + P(|(σb (Xj − Xk; r)) − (σ (Xj − Xk; r)) | > 2t)  √  In the end, let’s show the scale constant. Using the same ≤ |σM(X + X ; r) − σM(X + X ; r)| > t argument as above, we have P b j k j k  M M t  + |σ (Xj + Xk; r) − σ (Xj + Xk; r)| > 2 P b M M;r    σ (Xj + Xk; r) R = Q X0 ; r · Σ. √  M M  + P |σb (Xj − Xk; r) − σ (Xj − Xk; r)| > t

Because X0 is continuous and symmetric, letting qr :=  M M t    + P |σ (Xj − Xk; r) − σ (Xj − Xk; r)| > b σM(X − X ; r) Q X0 ; r , we have √ j k 2 2 ≤6 exp(−2n(η1 t/2 − 1/n) ) + 6 exp(−nη1t/2) (|X | ≤ q ) = 2 (X ≤ q ) − 1 = r + 6 exp(−2n(η t/(2θ ) − 1/n)2) + 6 exp(−nη2t2/(2θ2 )) P 0 r P 0 r 1 √max 1 max   2 1 + r 1 + r ≤24 max{exp(−2n(η1 t/2 − 1/n) ), ⇒ P(X0 ≤ qr) = ⇒ qr = Q X0 ; . 2 2 2 exp(−2n(η1t/(2θmax) − 1/n) )}. (A.12) Similarly, we have Combining Equations (A.11) and (A.12), we have, with prob- ability 1 − 24α2,   1 + r 2 RQ;r = Q Z ; · Σ. M;r M;r 0 2 ||Rb − R ||max ≤ r n 6  log d + log(1/α) 1 2 This finalizes the proof. max 2 + , η1 n n | {z } T1 r 4p||RM;r||  log d + log(1/α) 1  o F. Proof of Theorem V.5 max + , η1 n n Proof. Using Lemma A.2 and Assumption (A1), we have for | {z } T2 any j ∈ {1, . . . , d}, 2 whenever n is large enough such that T1 ≤ 8κ1 and T2 ≤ M M M M M 2κ1 ·minj6=k{2σ (Xj; r), σ (Xj +Xk; r), σ (Xj −Xk; r)}. P(|σb (Xj; r) − σ (Xj; r)| > t) 2 2 Combining the above inequality with Theorem V.3, we com- ≤ 3 exp(−2n(η1t/2 − 1/n) ) + 3 exp(−2n(η1t/2) ), plete the whole proof. IEEE TRANSACTIONS ON INFORMATION THEORY 19

G. Proof of Theorem V.6 implying that     Proof. Using Lemma A.3 and Assumption (A2), we have for Q;r Q;r Q;r Q;r Q;r Q;r P Re jk − Rjk ≥ t ≤P Re jk − Rb jk + Rb jk − Rjk ≥ t any j ∈ {1, . . . , d},  Q;r Q;r Q;r Q;r  ≤P ||Re − Rb ||max + ||Rb − R ||max ≥ t Q Q  2 P(|σb (X; r) − σ (X; r)| > t) ≤ exp − n[η2t − 1/n] Q;r Q;r ≤P(||Rb − R ||max ≥ t/2).  2 + exp − n(η2t) , Combined with (A.13) and (A.15), the above inequality im- plies that whenever t ≤ κ2 and η2t > 1/n. Using a similar proof  Q;r Q;r  4 p 2 2 technique as in the proof of Theorem V.5, we have P ||Re jk − Rjk ||max ≥ t ≤d (4 exp(−n[η2 t/4 − 1/n] ) + 4 exp(−n(η2t/(2ζmax) − 1/n) )). Q;r Q;r  Q Q p  Plugging t = ζ into the above equation, we have the desired P(|Rb jj −Rjj |>t)≤ P |σb (Xj; r)−σ (Xj; r)|> t/2 result.  t  + |σQ(X ; r)−σQ(X ; r)|> Secondly, we prove that (VII.5) holds. Because X4 ≤ K, P b j j Q E j 2σ (Xj; r) by Chebyshev’s inequality, with probability no smaller than  p 2 −2ξ ≤2 exp − n[η2 t/2 − 1/n] (A.13) 1 − n , we have   2 2 −1/2+ξ Q 2 |σj − σ (Xj)| ≤ c1n . + 2 exp −n(η2t/(2σ (Xj; r))−1/n) . (A.14) b Moreover, using (A.13), we have for any given j ∈ {1, . . . , d}, Similarly, we have by Markov inequality, with probability larger than or equal to Q;r Q;r 1 − 4α, (|R − R | > t) Q;r Q;r P b jk jk |R − R | ≤ ζ . √ b jj jj j  Q Q  ≤ P |σ (Xj + Xk; r) − σ (Xj + Xk; r)| > t Q;r b For notation simplicity, we denote σj := σj(Xj), rj := R ,  √  jj Q Q and r := RQ;r. Accordingly, we have + P |σb (Xj − Xk; r) − σ (Xj − Xk; r)| > t bj b jj   2 2 Q Q t σj σj + |σ (Xj + Xk; r) − σ (Xj + Xk; r)| > Q;r b Q;r Q;r P b σQ(X + X ; r) ||Σb − Σ||max = Re − R j k rbj rj max  Q Q t  2 2 2 + |σ (X − X ; r) − σ (X − X ; r)| > σj σj σj P b j k j k Q b Q;r Q;r Q;r Q;r σ (Xj − Xk; r) ≤ Re − Re + ||Re − R ||max √ rj rj max rj 2 2 b ≤ 4 exp(−n[η2 t − 1/n] ) + 4 exp(−n(η2t/ζmax − 1/n) ), σ2 σ2 bj j Q;r Q;r (A.15) ≤ − · (||Re − R ||max rbj rj p Q;r 1 where ζmax := 2 ||R ||max. Combining (A.13) and (A.15) Q;r Q;r Q;r + ||R ||max) + ||Re − R ||max leads to that, with probability larger than or equal to 1 − 8α, cQ;r cQ;r = r /σ2 Q;r Q;r while noting that j j . Finally, we have ||Rb − R ||max ≤ r σ2 σ2 σ2 − σ2  1 1  n 2  2 log d + log(1/α) 1 2 bj − j = bj j + σ2 − max + , r r r j r r η2 n n bj j bj bj j 2 |σ2 − σ2| σ2 | {z } bj j j |rbj − rj| T3 ≤ + · . rj − |rj − rj| rj rj − |rj − rj| p Q;r r b b 2 ||R ||max  2 log d + log(1/α) 1  o + , This implies that with probability no smaller than 1 − n−2ξ − η2 n n | {z } 12α, we have T 4 σ2 σ2 c n−1/2+ξ 1 ζ bj − j ≤ 1 + · j , whenever T ≤ 2κ2 and T ≤ 2κ · Q;r 3 2 4 1 rbj rj rj − ζj c rj − ζj min {2σQ(X ; r), σQ(X + X ), σQ(X − X )}. Finally, j6=k j j k j k which completes the proof of the second part. The third part combining the above inequality with Theorem V.3, we can then be proved by combining Equation (VII.5) and the complete the whole proof. proof of Theorem 2 in (56). In this section we provide the proofs of the results presented in Section VII. I. Proof of Corollary VII.4 Proof. According to Theorem 6 in (40), if the tuning param- Q;r eter λ ≥ kΘk1,∞kΣb − Σkmax, we have there exists two H. Proof of Corollary VII.1 constants C1,C2 such that Proof. We first prove that (VII.4) holds. Because RQ;r is Q;r 1−q Q;r kΘb − Θk2 ≤ C1sλ , kΘb − Θkmax ≤ kΘk1,∞λ. feasible to Equation (VII.2), we have Q;r 2 2−q and kΘb −ΘkF ≤ C2sdλ . Combining the above results Q;r Q;r Q;r Q;r ||Rb − Re ||max ≤ ||Rb − R ||max, with Corollary VII.1, we prove the desire rate. IEEE TRANSACTIONS ON INFORMATION THEORY 20

J. Proof of Corollary VII.5 Algorithm 1 Matrix nearness problem in the maximum norm in (A.20) Proof. According to Theorem 3.1 in (67), if λ ≥ kRb −Rkmax 0 0 and max1≤k≤m ||uk(Σ)||0 ≤ s, we have Re ← MatrixMaxProj(Rb , Z , R , γ, , N) for t = 0,...,N do Q;r 4sλ Rt ← Rt − P (Rt − Zt) kΠb m − ΠmkF ≤ · ||Rb − R||max. (A.16) 0 B2 λ1(R) − λ2(R) t t t t Z0 ← Z − PB1 (R + Z − Rb ) t t Q;r if max{kR0kmax, kR0kmax} < , then where Πm , R, and Rb can be the Q-PCA estimator, pop- ulation and sample versions of the quantile-based scatter break matrix estimators based on the gQNE. Combining the rate of else t+1 t t t kR − Rk in Theorem V.5 and Theorem V.6, the theorem R ← R − γ(R0 − Z0)/2 b max t+1 t t t is proved. Z ← Z − γ(R0 + Z0)/2 end if end for return Re = Rt K. Proof of Corollary VII.6 Proof. We mainly follow the procedure in the proof of The- T ∗ orem 2 in (71). Suppose δ β ≤ M. Let δ = µ1 − µ2, by the algorithm proposed in (77). Since for any matrix A ∈ δ = µ − µ . Since (βQ;r)Q;r is the solution of (VII.9), we d T b b1 b2 b R we can reformulate kAkmax = maxZ∈B1 Tr(Z A), where d×d T P have B1 = {Z ∈ R | Z = Z , i,j=1,...,d |Zij| ≤ 1}. We d×d ∗ T Q;r Q;r ∗ T ∗ ∗ define B2 = {Z ∈ R | 0} and (A.20) can be rewritten |(β ) Σb βb −(β ) δ| ≤ (λ+kδb−δk∞)kβ k1 ≤ 2λkβ k1, as a minimax problem and min max Tr(ZT (R − Rb )). (A.21) ∗ T Q;r Q;r Q;r T Q;r R∈B Z∈B |(β ) Σb βb − (βb ) δ| ≤ (λ + kδb − δk∞)kβb k1. 2 1 Combining the above two equations together, we have |(βbQ;r− In order to solve the above minimax problem, we need to first ∗ T ∗ β ) δ| ≤ 4λkβ k1 which implies study two subproblems on matrix projection. The first is

Q;r T 2 2 ((βb ) δ) −1 ∗ PB2 (A) = arg min kA − BkF, (A.22) ≥ 1 − 8M λkβ k1. (A.17) (δT β∗)2 B∈B2

On the other hand, we have where k · kF is the Frobenius norm. We can have the closed T Q;r Q;r Q;r form solution to A.22 such that PB2 (A) = UΛ+U , where kΣβb − δk∞ ≤ k(Σ − Σb )βb k∞ + 2λ A = UΛUT is the spectral decomposition of A and ∗ Q;r ≤ kβ k1kkΣ − Σb kmax + 2λ, Λ+ = diag(Λ11 ∨ 0,..., Λdd ∨ 0). The second subproblem 2 we are interested in is PB (A) = arg min kA − Bk . which implies that 1 B∈B2 F This is equivalent to the vectorized problem PB1 (A) = Q;r T Q;r T ∗ arg min kvec(A) − vk2. We denote a = vec(A) and |(βb ) Σβb − δ β | kvk1≤1 2 |a| = sign(a) ◦ a, where ◦ is the Hadamard product. Let ≤ |(βbQ;r)T ΣβbQ;r − δT βbQ;r| + |(βbQ;r − β∗)T δ| T|a| be the permutation transformation matrix of |a| such that ∗ 2 Q;r ∗ ≤ kβ k1kkΣ − Σb kmax + 6λkβ k1, (A.18) T|a|(|a|) is in descending order. We now define xe, ye under two kinds of cases: kak1 ≤ 1 and kak1 > 1. Q;r where the last inequality is due to the definition of βb . If kak ≤ 1, we let (x, y) := (a, 0). If kak > 1, we define Q;r T Q;r 1 e e e 1 Therefore, for sufficiently large n, we have (βb ) Σβb ≥ T ∆a = (ea1 − ea2,..., ead−1 − ead, ead) . Since ∆ai > 0 for alll M/2. We can rewrite (A.18) as Pd i = 1, . . . , n and i=1 i∆ai = kak1 > 1. We choose the T ∗ PK δ β −1 ∗ 2 Q;r smallest integer K such that i=1 i∆ai ≥ 1. Let ≥ 1−2M kβ k kkΣ−Σb kmax. (A.19) Q;r T Q;r 1 (βb ) Σβb K 1  X  T d Combining (A.17) and (A.19), we have y := ai−1 , x := (a1−y, . . . , aK −y, 0,..., 0) ∈ . e K e e e e e e R i=1 Rq((βbQ;r)Q;r) ≥ 1−2M −1kβ∗k2kkΣ−ΣQ;rk −20M −1λkβ∗k . ∗ 1 b max According1 to (77), we can write P (A) = sign(a)◦T −1 (x). Rq(β ) B1 kak e 0 0 This completes the proof. We set arbitrary Z ∈ B1, R ∈ B2 as the initializations of the algorithm, γ ∈ (0, 2) as the step length of each iteration,  > 0 This section describes the algorithm to solve (IV.4) for the as the tolerance, N as the maximum number of iterations. matrix element-wise supremum norm k · kmax. In particular, Algorithm1 provides the following convergence to the exact we aim to solve solution of (A.20). opt opt Re = arg min ||R − Rb ||max (A.20) Theorem A.4 ((77)). If (R , Z ) is the solution to (A.21). R0 t t t t Let R , Z , R0, Z0 be the sequence obtained from Algorithm IEEE TRANSACTIONS ON INFORMATION THEORY 21

1. We have [16] Z. Wang, H. Liu, and T. Zhang, “Optimal computational and statistical rates of convergence for sparse nonconvex t+1 opt 2 t+1 opt 2 kR − R kF + kZ − Z kF learning problems,” The Annals of Statistics, vol. 42, t opt 2 t opt 2 ≤ kR − R kF + kZ − Z kF no. 6, pp. 2164–2201, 12 2014. [Online]. Available: γ(2 − γ) http://dx.doi.org/10.1214/14-AOS1238 + (kRt k2 + kZt k2 ). 2 0 F 0 F [17] J. Fan, T. Ke, H. Liu, and L. Xia, “QUADRO: A super- vised dimension reduction method via rayleigh quotient optimization,” arXiv preprint arXiv:1311.5542, 2013. REFERENCES [18] J. Fan, F. Han, and H. Liu, “PAGE: robust pattern guided [1] P. J. Bickel and E. Levina, “Covariance regularization by estimation of large covariance matrix,” technical report, thresholding,” The Annals of Statistics, vol. 36, no. 6, pp. 2014. 2577–2604, 2008. [19] O. Catoni, “Challenging the empirical mean and empir- [2] ——, “Regularized estimation of large covariance ical variance: a deviation study,” Annales de l’Institut matrices,” The Annals of Statistics, vol. 36, no. 1, Henri Poincare,´ Probabilites´ et Statistiques, vol. 48, pp. 199–227, 02 2008. [Online]. Available: http: no. 4, pp. 1148–1185, 2012. //dx.doi.org/10.1214/009053607000000758 [20] R.-Z. Li, K.-T. Fang, and L.-X. Zhu, “Some QQ probabil- [3] T. T. Cai, C.-H. Zhang, and H. H. Zhou, “Optimal rates ity plots to test spherical and elliptical symmetry,” Jour- of convergence for covariance matrix estimation,” The nal of Computational and Graphical Statistics, vol. 6, Annals of Statistics, vol. 38, no. 4, pp. 2118–2144, 2010. no. 4, pp. 435–450, 1997. [4] T. T. Cai and H. H. Zhou, “Minimax estimation of large [21] V. Koltchinskii and L. Sakhanenko, “Testing for ellip- covariance matrices under `1 norm,” Statistica Sinica, soidal symmetry of a multivariate distribution,” in High vol. 22, pp. 1319–1378, 2012. Dimensional Probability II. Springer, 2000, pp. 493– [5] K. Lounici, “High-dimensional covariance matrix estima- 510. tion with missing observations,” Bernoulli (to appear), [22] L. Sakhanenko, “Testing for ellipsoidal symmetry: A 2014. comparison study,” Computational Statistics and Data [6] W.-K. Chen, The Circuits and Filters Handbook. CRC Analysis, vol. 53, no. 2, pp. 565–581, 2008. Press, 2002. [23] F. W. Huffer and C. Park, “A test for elliptical symmetry,” [7] B. O. Bradley and M. S. Taqqu, “Financial risk and Journal of Multivariate Analysis, vol. 98, no. 2, pp. 256– heavy tails,” Handbook of Heavy Tailed Distributions in 281, 2007. Finance, pp. 35–103, 2003. [24] A. Batsidis, N. Martin, L. Pardo, and K. Zografos, [8] F. Han and H. Liu, “Scale-invariant sparse PCA on “A necessary power divergence-type family of tests for high dimensional Meta-Elliptical data,” Journal of the testing elliptical symmetry,” Journal of Statistical Com- American Statistical Association, vol. 109, no. 505, pp. putation and Simulation, vol. 84, no. 1, pp. 57–83, 2014. 275–287, 2014. [25] S. Holm, “A simple sequentially rejective multiple test [9] H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasser- procedure,” Scandinavian journal of statistics, vol. 6, man, “High-dimensional semiparametric Gaussian cop- no. 2, pp. 65–70, 1979. ula graphical models,” The Annals of Statistics, vol. 40, [26] D. Donoho and J. Jin, “Higher criticism for detecting no. 4, pp. 2293–2326, 2012. sparse heterogeneous mixtures,” The Annals of Statistics, [10] L. Xue and H. Zou, “Regularized rank-based estimation vol. 32, no. 3, pp. 962–994, 2004. of high-dimensional nonparanormal graphical models,” [27] P. Hall and J. Jin, “Innovated higher criticism for de- The Annals of Statistics, vol. 40, no. 5, pp. 2541–2571, tecting sparse signals in correlated noise,” The Annals of 2012. Statistics, vol. 38, no. 3, pp. 1686–1732, 2010. [11] F. Han and H. Liu, “Semiparametric principal component [28] R. A. Maronna, R. D. Martin, and V. J. Yohai, Robust analysis,” in Advances in Neural Information Processing Statistics: Theory and Methods. J. Wiley, 2006. Systems, 2012, pp. 171–179. [29] F. R. Hampel, “The influence curve and its role in [12] F. Han, T. Zhao, and H. Liu, “CODA: high dimensional robust estimation,” Journal of the American Statistical copula discriminant analysis,” The Journal of Machine Association, vol. 69, no. 346, pp. 383–393, 1974. Learning Research, vol. 14, no. 1, pp. 629–671, 2013. [30] P. J. Rousseeuw and C. Croux, “Alternatives to the [13] F. Han and H. Liu, “Transelliptical component analysis,” median absolute deviation,” Journal of the American in Advances in Neural Information Processing Systems Statistical Association, vol. 88, no. 424, pp. 1273–1283, 25, 2012, pp. 368–376. 1993. [14] H. Liu, F. Han, and C.-H. Zhang, “Transelliptical graphi- [31] C. Croux and A. Ruiz-Gazen, “High breakdown esti- cal models,” in Advances in Neural Information Process- mators for principal components: the projection-pursuit ing Systems, 2012, pp. 809–817. approach revisited,” Journal of Multivariate Analysis, [15] F. Han and H. Liu, “Optimal rates of convergence for la- vol. 95, no. 1, pp. 206–226, 2005. tent generalized correlation matrix estimation in transel- [32] P. J. Huber and E. Ronchetti, Robust Statistics, 2nd liptical distribution,” arXiv preprint arXiv:1305.6916, Edition. Wiley, 2009. 2013. [33] R. Gnanadesikan and J. R. Kettenring, “Robust estimates, IEEE TRANSACTIONS ON INFORMATION THEORY 22

residuals, and outlier detection with multiresponse data,” distributions with given marginals,” Journal of Multivari- Biometrics, vol. 28, no. 1, pp. 81–124, 1972. ate Analysis, vol. 82, no. 1, pp. 1–16, 2002. [34] M. G. Genton and Y. Ma, “Robustness properties of [50] H. Liu, J. Lafferty, and L. Wasserman, “The nonpara- dispersion estimators,” Statistics and Probability Letters, normal: Semiparametric estimation of high dimensional vol. 44, no. 4, pp. 343–350, 1999. undirected graphs,” The Journal of Machine Learning [35] Y. Ma and M. G. Genton, “Highly robust estimation of Research, vol. 10, pp. 2295–2328, 2009. dispersion matrices,” Journal of Multivariate Analysis, [51] A. B. Tsybakov, Introduction to Nonparametric Estima- vol. 78, no. 1, pp. 11–36, 2001. tion. Springer New York, 2009. [36] R. A. Maronna and R. H. Zamar, “Robust estimates of [52] Z. Bai and Y. Yin, “Limit of the smallest eigenvalue of a location and dispersion for high-dimensional datasets,” large dimensional sample covariance matrix,” The Annals Technometrics, vol. 44, no. 4, pp. 307–317, 2002. of Probability, vol. 21, no. 3, pp. 1275–1294, 1993. [37] L. Wang, Y. Wu, and R. Li, “Quantile regression for ana- [53] I. M. Johnstone, “On the distribution of the largest lyzing heterogeneity in ultra-high dimension,” Journal of eigenvalue in principal components analysis,” The Annals the American Statistical Association, vol. 107, no. 497, of Statistics, vol. 29, no. 2, pp. 295–327, 2001. pp. 214–222, 2012. [54] N. El Karoui, “Operator norm consistent estimation [38] A. Belloni and V. Chernozhukov, “`1-penalized quantile of large-dimensional sparse covariance matrices,” The regression in high-dimensional sparse models,” The An- Annals of Statistics, vol. 36, no. 6, pp. 2717–2756, nals of Statistics, vol. 39, no. 1, pp. 82–130, 2011. 12 2008. [Online]. Available: http://dx.doi.org/10.1214/ [39] L. Wang, “L1 penalized LAD estimator for high di- 07-AOS559 mensional linear regression,” Journal of Multivariate [55] A. J. Rothman, E. Levina, and J. Zhu, “Generalized Analysis, vol. 120, pp. 135–151, 2013. thresholding of large covariance matrices,” Journal of the [40] T. Cai, W. Liu, and X. Luo, “A constrained `1 mini- American Statistical Association, vol. 104, no. 485, pp. mization approach to sparse precision matrix estimation,” 177–186, 2009. Journal of the American Statistical Association, vol. 106, [56] L. Xue, S. Ma, and H. Zou, “Positive-definite `1- no. 494, pp. 594–607, 2011. penalized estimation of large covariance matrices,” Jour- [41] K.-T. Fang, S. Kotz, and K. W. Ng, Symmetric Multivari- nal of the American Statistical Association, vol. 107, no. ate and Related Distributions Monographs on Statistics 500, pp. 1480–1491, 2012. and Applied Probability. London: Chapman and Hall [57] H. Liu, L. Wang, and T. Zhao, “Sparse covariance Ltd. MR1071174, 1990. matrix estimation with eigenvalue constraints,” Journal [42] J. Owen and R. Rabinovitch, “On the class of elliptical of Computational and Graphical Statistics (to appear), distributions and their applications to the theory of port- 2013. folio choice,” The Journal of Finance, vol. 38, no. 3, pp. [58] T. T. Cai and H. H. Zhou, “Optimal rates of convergence 745–752, 1983. for sparse covariance matrix estimation,” The Annals of [43] J. B. Berk, “Necessary conditions for the CAPM,” Jour- Statistics, vol. 40, no. 5, pp. 2389–2420, 2012. nal of Economic Theory, vol. 73, no. 1, pp. 245–257, [59] N. Meinshausen and P. Buhlmann,¨ “High-dimensional 1997. graphs and variable selection with the lasso,” The [44] A. J. McNeil, R. Frey, and P. Embrechts, Quantitative Annals of Statistics, vol. 34, no. 3, pp. 1436–1462, Risk Management: Concepts, Techniques, and Tools. 06 2006. [Online]. Available: http://dx.doi.org/10.1214/ Princeton University Press, 2010. 009053606000000281 [45] P. Embrechts, A. McNeil, and D. Straumann, “Correla- [60] M. Yuan and Y. Lin, “Model selection and estimation in tion and dependence in risk management: properties and the gaussian graphical model,” Biometrika, vol. 94, no. 1, pitfalls,” Risk Management: Value at Risk and Beyond, pp. 19–35, 2007. pp. 176–223, 2002. [61] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse in- [46] D. B. Marden and D. G. Manolakis, “Using elliptically verse covariance estimation with the graphical Lasso,” contoured distributions to model hyperspectral imaging Biostatistics, vol. 9, no. 3, pp. 432–441, 2008. data and generate statistically similar synthetic data,” in [62] O. Banerjee, L. El Ghaoui, and A. d’Aspremont, “Model Defense and Security. International Society for Optics selection through sparse maximum likelihood estimation and Photonics, 2004, pp. 558–572. for multivariate gaussian or binary data,” The Journal of [47] J. Frontera-Pons, M. Mahot, J. P. Ovarlez, F. Pascal, S. K. Machine Learning Research, vol. 9, pp. 485–516, 2008. Pang, and J. Chanussot, “A class of robust estimates [63] C. Lam and J. Fan, “Sparsistency and rates of for detection in hyperspectral images using elliptical convergence in large covariance matrix estimation,” The distributions background,” in Geoscience and Remote Annals of Statistics, vol. 37, no. 6B, pp. 4254–4278, Sensing Symposium (IGARSS), 2012 IEEE International. 2009. [Online]. Available: http://www.jstor.org/stable/ IEEE, 2012, pp. 4166–4169. 25662231 [48] G. Frahm, “Generalized elliptical distributions: theory [64] M. Yuan, “High dimensional inverse covariance matrix and applications,” Ph.D. dissertation, Universitat¨ zu Koln,¨ estimation via linear programming,” The Journal of Ma- 2004. chine Learning Research, vol. 11, pp. 2261–2286, 2010. [49] H.-B. Fang, K.-T. Fang, and S. Kotz, “The meta-elliptical [65] T. T. Cai, W. Liu, and H. H. Zhou, “Estimating sparse IEEE TRANSACTIONS ON INFORMATION THEORY 23

precision matrix: Optimal rates of convergence and adap- Fang Han received his Ph.D. from the Department of Biostatistics, Johns tive estimation,” The Annals of Statistics (to appear), Hopkins Bloomberg School of Public Health. He is currently an Assistant Professor of Statistics in University of Washington. Dr. Han’s main method- 2014. ological/theoretical interests are in rank-based statistics; nonparametric and [66] I. M. Johnstone and A. Y. Lu, “On consistency and semiparametric regressions; time series analysis; (iv) random matrix theory. sparsity for principal components analysis in high dimen- sions,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 682–693, 2009. [67] V. Q. Vu, J. Cho, J. Lei, and K. Rohe, “Fantope projection and selection: A near-optimal convex relaxation of sparse PCA,” in Advances in Neural Information Processing Systems, 2013, pp. 2670–2678. [68] Y. Guo, T. Hastie, and R. Tibshirani, “Regularized dis- criminant analysis and its application in microarrays,” Biostatistics, vol. 1, no. 1, pp. 1–18, 2005. [69] J. Fan and Y. Fan, “High-dimensional classification using features annealed independence rules,” The Annals of Statistics, vol. 36, no. 6, pp. 2605–2637, 12 2008. [Online]. Available: http://dx.doi.org/10.1214/ 07-AOS504 [70] J. Shao, Y. Wang, X. Deng, and S. Wang, “Sparse linear discriminant analysis by thresholding for high dimensional data,” The Annals of Statistics, vol. 39, no. 2, pp. 1241–1265, 04 2011. [Online]. Available: http://dx.doi.org/10.1214/10-AOS870 [71] T. Cai and W. Liu, “A direct estimation approach to sparse linear discriminant analysis,” Journal of the Amer- ican Statistical Association, vol. 106, no. 496, pp. 1566– 1577, 2011. [72] J. Fan, Y. Feng, and X. Tong, “A ROAD to classification Han Liu received his joint Ph.D. degree in Machine Learning and Statistics in high dimensional space.” Journal of the Royal Statis- from the Machine Learning Department at the Carnegie Mellon University. He tical Society. Series B, Statistical Methodology, vol. 74, is currently an Associate Professor in the Department of Electrical Engineering no. 4, pp. 745–771, 2012. & Computer Science and Department of Statistics at the Northwestern University. Before joining the Northwestern University, he has directed the [73] M. N. McCall, B. M. Bolstad, and R. A. Irizarry, center of deep reinforcement learning at Tencent AI Lab and had been a “Frozen robust multiarray analysis (fRMA),” Biostatis- faculty member at the Princeton University and Johns Hopkins University. tics, vol. 11, no. 2, pp. 242–253, 2010. His research direction is ubiquitous analytics, which deploys computational intelligence (AI) and computational trust (Blockchain) on edges and clouds [74] J. Fan, Y. Liao, and M. Mincheva, “Large covariance to achieve analytical advantages. estimation by thresholding principal orthogonal comple- ments,” Journal of the Royal Statistical Society: Series B, Statistical Methodology, vol. 75, no. 4, pp. 603–680, 2013. [75] F. Han and H. Liu, “Principal component analysis on non-Gaussian dependent data,” in International Confer- ence on Machine Learning, vol. 30, 2013, pp. 240–248. [76] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 13–30, 1963. [77] M. H. Xu and H. Shao, “Solving the matrix nearness problem in the maximum norm by applying a projec- tion and contraction method,” Advances in Operations Research, vol. 2012, pp. 1–15, 2012.

Junwei Lu received his Ph.D. degree from the Department of Operations Research and Financial Enigeering at Princeton University. He is currently an Assistant Professor of Biostatistics in Harvard T.H. Chan School of Public Health. Dr. Lu’s research interests include statistics, machine learning, and their applications to genomics and neuroscience.