Significance Testing for Canonical Correlation Analysis in High
Total Page:16
File Type:pdf, Size:1020Kb
Significance testing for canonical correlation analysis in high dimensions Ian W. McKeaguey and Xin Zhangz yColumbia University and zFlorida State University Abstract We consider the problem of testing for the presence of linear relationships between large sets of random variables based on a post-selection inference approach to canonical correla- tion analysis. The challenge is to adjust for the selection of subsets of variables having linear combinations with maximal sample correlation. To this end, we construct a stabilized one- step estimator of the euclidean-norm of the canonical correlations maximized over subsets of variables of pre-specified cardinality. This estimator is shown to be consistent for its target pa- rameter and asymptotically normal provided the dimensions of the variables do not grow too quickly with sample size. We also develop a greedy search algorithm to accurately compute the estimator, leading to a computationally tractable omnibus test for the global null hypothesis that there are no linear relationships between any subsets of variables having the pre-specified cardinality. Further, we develop a confidence interval for the target parameter that takes the variable selection into account. Key words: Efficient one-step estimator; Greedy search algorithm; Large-scale testing; Pillai trace; Post-selection inference 1 Introduction When exploring the relationships between two sets of variables measured on the same set of obser- arXiv:2010.08673v1 [stat.ME] 17 Oct 2020 vations, canonical correlation analysis (CCA; Hotelling, 1936) sequentially extracts linear combi- nations with maximal sample correlation. Specifically, with X 2 Rp and Y 2 Rq as two random vectors, the first step of CCA targets the parameter ρ = max corr(αT Y; βT X); subject to var(αT Y) = 1 = var(βT X): (1) α2Rq;β2Rp Subsequent steps of CCA repeat this process subject to the constraint that the next linear combina- tions of X (and Y) are uncorrelated with earlier ones, giving a decreasing sequence of correlation coefficients. We are interested in testing whether the maximal canonical correlation coefficient 1 ρ 6= 0 versus the null hypothesis ρ = 0 in the high-dimensional setting in which p and q grow with sample size n. This is equivalent to testing whether all of the canonical correlation coefficients vanish, or whether their sum of squares τ 2 (known as the Pillai (1955) trace) vanishes. Over the last dozen years, numerous sparse canonical correlation analysis (SCCA) methods (e.g., Witten et al., 2009; Hardoon and Shawe-Taylor, 2011; Gao et al., 2017; Mai and Zhang, 2019; Qadar and Seghouane, 2019; Shu et al., 2020) have been developed as extensions of classical CCA by adapting regularization approaches from regression, e.g., lasso (Tibshirani, 1996), elastic net (Zou and Hastie, 2005) and soft thresholding. SCCA methods have been widely applied to high-dimensional omics data to detect associations between gene expression and DNA copy number/polymorphisms/methylation, with the aim of revealing networks of co-expressed and co-regulated genes (Waaijenborg and Zwinderman, 2007; Waaijenborg et al., 2008; Naylor et al., 2010; Parkhomenko et al., 2009; Wang et al., 2015). A problem with the indiscriminate use of such methods, however, is selection bias, arising when the effects of variable selection on subsequent statistical analyses are ignored, i.e., failure to take into account “double dipping” of the data when assessing evidence of association. Devising valid tests for associations in high-dimensional SCCA, along with confidence interval estimation for the strength of the association, poses a challenging post-selection inference problem. Nevertheless, some progress on this problem has been made. Yang and Pan (2015) proposed the sum of sample canonical correlation coefficients as a test statistic and established a valid calibration under the sparsity assumption that the number of non-zero canonical correlations is finite and fixed, with the dimensions p and q proportional to sample size. Their approach comes at the cost of assuming that X and Y are jointly Gaussian (and thus fully independent under the null); similar results for the maximal sample canonical correlation coefficient are developed in Bao et al. (2019). Zheng et al. (2019) developed a test for the presence of correlations among arbitrary components of a given high-dimensional random vector, for both sparse and dense alternatives, but their approach also requires an independent components structure. In this paper, we provide valid post-selection inference for a new version of SCCA in high- dimensional settings. We obtain a computationally tractable and asymptotically valid confidence 2 interval for τmax, where τmax is the maximum of the Pillai trace over all subvectors of X and Y having prespecified dimensions sx and sy, respectively. The method is fully nonparametric 2 in the sense that no distributional assumptions or sparsity assumptions are required. Rather than adopting a penalization approach or making a sparsity assumption on the number of non-zero canonical correlations to regularize the problem, we use the sparsity levels sx p and sy q for regularization, and also for controlling the computational cost of searching over large collections of subvectors. We introduce a test statistic τbmax constructed as a stabilized and efficient one-step estimator of τmax. Then, assuming p and q do not grow too quickly with sample size, specifically p that log(p + q)= n ! 0, we show that a studentized version of τbmax (after centering by τmax) converges weakly to standard normal. This leads to a practical way of calibrating a formal omnibus test for the global null hypothesis (τmax = 0) that there are no linear relationships between any subsets of variables having the pre-specified cardinality, along with an asymptotically valid Wald- type confidence interval for τmax. The proposed approach applies to any choice of pre-specified sparsity levels sx and sy, which do not need to be the same as the true number of “active” variables in the population CCA, although they should be sufficiently large to capture the key associations. The test procedure and confidence interval for the target parameter τmax are asymptotically valid for any pre-specified sparsity levels, and work well provided the sample cross-covariance matrices between subvectors of X and Y having dimensions sx and sy are sufficiently accurate. Our approach is related to the type of post-selection inference procedure for marginal screen- ing developed by McKeague and Qian (2015), which applies to the one-dimensional response case (q = 1 in the present notation). To extend this approach to the SCCA setting, in which both p and q can be large, requires a trade-off between computational tractability and statistical power. The calibration used in McKeague and Qian (2015) is a double-bootstrap technique, which is compu- tationally expensive. To obtain a fast calibration method for SCCA, we adapt the sample-splitting stabilization technique of Luedtke and van der Laan (2018) to the SCCA setting, which provides calibration using a standard normal limit. Further, to control the computational complexity of searching through large collections of subvectors of X and Y when computing τbmax, we develop a greedy search algorithm related to that of Wiesel et al. (2008). The rest of the article is organized as follows. Section 2 introduces the population target pa- rameter τmax and develops its stabilized one-step estimator, taking the non-regularity of τmax at the global null hypothesis into account; asymptotic results are given in Section 2.4. Section 3 pro- 3 poses the greedy search algorithm to speed up the computation and provides a rationale based on submodularity. Sections 4 and 5 respectively contain a simulation study and a real data example using data collected under the Cancer Genome Atlas Program (Weinstein et al., 2013). Section 6 concludes the paper with a short discussion. The Appendix contains a derivation of the influence function of the Pillai trace, which plays a key role in its efficient estimation, and the proof of an identity involving increments of the Pillai trace used in the greedy search algorithm. The Supple- mentary Materials collect all additional technical details, numerical results and R code. 2 Test procedure 2.1 Preliminaries Let ΣX > 0 and ΣY > 0 denote the (invertible) covariance matrices of X and Y, with cross- −1=2 −1=2 covariance matrix ΣXY and standardized cross-covariance matrix ΛXY ≡ ΣX ΣXYΣY 2 p×q R (also known as the coherence matrix). The sample counterparts are denoted SX, SY, SXY and CXY, respectively. The coherence matrix ΛXY has min(p; q) singular values; when listed in decreasing order they coincide with the canonical correlation coefficients, and ρ defined in (1) is the largest. A closely related parameter in MANOVA is the Pillai trace τ 2 (Pillai, 1955), defined as the sum of squares of the canonical correlation coefficients, or equivalently 2 2 T −1 τ = kΛXYkF = tr(ΛXYΛXY) = trfH(H + E) g; (2) −1 where k · kF is Frobenius norm, and H = ΣYXΣX ΣXY and E = ΣY − H are population versions of covariance matrices in a linear model for predicting Y from X. Specifically, H is the covariance matrix of the least-squares-predicted outcome in the linear model Y = A + BX + ", where cov(") = E and " is uncorrelated with X. We will need some general concepts from semi-parametric efficiency theory. Suppose we ob- 2 serve a general random vector O ∼ P . Let L0(P ) denote the Hilbert space of P -square integrable functions with mean zero. Consider a smooth one-dimensional family of probability measures 2 fPt; t 2 [0; 1]g with P0 = P and having score function k 2 L0(P ) at t = 0. The tangent space 2 T (P ) is the L0(P )-closure of the linear span of all such score functions k. For example, if noth- ing is known about P , then Pt(do) = (1 + tk(o))P (do) is such a submodel for any bounded 4 function k with mean zero (provided t is sufficiently small), so T (P ) is seen to be the whole of 2 L0(P ) in this case.