Learning Gaussian Graphical Models with Latent Confounders ∗

Learning Gaussian Graphical Models with Latent Confounders ∗ Ke Wang,† Alexander Franks, Sang-Yun Oh Department of Statistics and Applied Probability, University of California, Santa Barbara, Santa Barbara, CA, USA Abstract Gaussian Graphical models (GGM) are widely used to estimate the network structures in many applications ranging from biology to finance. In practice, data is often corrupted by latent confounders which biases inference of the underlying true graphical structure. In this paper, we compare and contrast two strategies for inference in graphical models with latent confounders: Gaussian graphical models with latent variables (LVGGM) and PCA-based removal of confounding (PCA+GGM). While these two approaches have similar goals, they are motivated by different assumptions about confounding. In this paper, we explore the connection between these two approaches and propose a new method, which combines the strengths of these two approaches. We prove the consistency and convergence rate for the PCA-based method and use these results to provide guidance about when to use each method. We demonstrate the effectiveness of our methodology using both simulations and in two real-world applications. arXiv:2105.06600v1 [stat.ME] 14 May 2021 ∗The authors would like to thank Dr. Christos Thrampoulidis and Megan Elcheikhali for the very helpful discussion and comments on this paper. †[email protected] 1 1 Introduction In many domains, it is necessary to characterize relationships between features using network models. For example, networks have been used to identify transcriptional patterns and regulatory relationships in genetic networks and applied as a way to characterize functional brain connectivity and cognitive disorders, and provide insights into neurodegenerative diseases like Alzheimer’s (Fox and Raichle, 2007; Van Dijk et al., 2012; Barch et al., 2013; Price et al., 2014). One of the most common methods for inferring a network from observations is the Gaussian graphical model (GGM). A GGM is defined with respect to a graph, in which the nodes correspond to joint Gaussian random variables and the edges correspond to the conditional dependencies among pairs of variables. A key property of GGM is that the presence or absence of edges can be obtained from the precision matrix for multivariate Gaussian random variables (Lauritzen, 1996). Similar to LASSO regression (Tibshirani, 1996), we can learn a GGM via sparse precision matrix estimation with l1-regularized maximum likelihood estimation. This family of approaches is called graphical lasso (Glasso) (Friedman et al., 2008; Yuan and Lin, 2007). In practice, however, network inference may be complicated due to the presence of latent confounders. For example, when characterizing relationships between the stock prices of publicly trade companies, the existence of overall market and sector factors induces extra correlation between stocks (Choi et al., 2011), which can obscure the underlying network structure between companies. Let Ω be the precision matrix encoding the graph structure of interest, and Σ = (Ω)−1. When latent confounders are present, the covariance matrix for the observed data, Σobs can be expressed as Σobs = Σ + LΣ; (1) where the positive semidefinite matrix LΣ reflects the effect of latent confounders. Alternatively, by the Sherman-Morrison identity (Horn and Johnson, 2012), the observed precision matrix can be expressed as, −1 Σobs = Ω − LΩ (2) where LΩ again reflects the effect of unobserved confounding. Importantly, if confounding is ignored, the inferred networks may heavily biased because the observed precision matrix is no longer (necessarily) sparse. Motivated by this problem, multiple approaches have been proposed to recover the observed-data graph encoded by Ω in the presence of confounding. In this paper, the goal is to generalize two seemingly different notions of confounding into a common framework for addressing the effect of LΣ in order to obtain the graph structure encoded in Ω = Σ−1. We compare and contrast two methods for GGM inference with latent confounders and propose a generalization which combines these methods. In order to control for the effects of confounding, some key assumptions are required. The first assumption is that Ω is sparse and LΩ, or equivalently LΣ, is low rank. The low rank assumption is equivalent to assuming that the number of confounding variables is small relative to the number of observed variables. As such, these methods are often referred to as “sparse plus low rank” methods (Chandrasekaran et al., 2011). We focus on two prominent approaches for sparse plus low rank inference, which require different assumptions for identifiability of Ω. One common approach, known as latent variable Gaussian Graphical Models (LVGGM), uses the parameterization in (2) and involves joint inference for Ω and LΩ. In this approach, the focus is on the effect of unobserved variables in the complete graph, which affect the partial correlations of the variables in the observed graph Ω. This perspective can be particularly useful when the unobserved variables would have been included in the graph, had they been unobserved. In LVGGM, in order for Ω to be identifiable, LΩ must be sufficiently dense (Chandrasekaran et al., 2011). An alternative approach uses principal component analysis (PCA) as a preprocessing step to remove the effect of confounders. The focus of this approach is on how confounders might affect the marginal correlation between observed variables. This approach uses the parameterization in (1) and involves a first stage in which effect of LΣ is removed by subtracting the leading eigencomponent of Σobs and a second stage of standard GGM inference (Parsana et al., 2019). We call this PCA-based approach 2 PCA+GGM. For this approach to be effective, the norm of LΣ (noise) must be large relative to the norm of Σ (signal). PCA+GGM has shown to be especially useful in estimating gene co-expression networks, where correlated measurement noise and batch effects induce large extraneous marginal correlations between observed variables (Geng et al., 2018; Leek and Storey, 2007; Stegle et al., 2011; Gagnon-Bartsch et al., 2013; Freytag et al., 2015; Jacob et al., 2016). In contrast to Chandrasekaran et al.(2012), the confounders in Parsana et al.(2019) are usually thought of as nuisance variables and would not be included in the complete graph, even if they had been observed. In practice, it is quite possible that the data is affected by both sources of confounding, e.g. measurement batch effects as well as unobserved nodes in the complete graph of interest. Moreover, the presumed type of confounding will effect which approach is most appropriate. In this paper, our goal is to explore proper ways to address the effect of confounders in order to obtain the graph structure encoded in Ω. We relate and compare two methods, LVGGM and PCA+GGM, then propose a new method PCA+LVGGM that combines those two, aiming to address confounding from multiple sources. The combined approach is more general, since PCA+LVGGM contains both LVGGM and PCA+GGM as special cases. We demonstrate the efficacy of PCA+LVGGM reconstructing the underlying network structure from gene co-expression data and stock return data in section5. In summary, our contributions are: • We carefully compare PCA+GGM and LVGGM, and illustrate the connection and difference between these two methods. We provide a non-asymptotic convergence result for the PCA+GGM proposed by (Parsana et al., 2019) and show that the convergence rate of PCA+GGM is of the same order as that of LVGGM. We demonstrate both theoretically and empirically that PCA+GGM works particularly well when the norm of extraneous low-rank component is large compared to that of the signal. • We propose PCA+LVGGM, which combines elements of PCA+GGM and LVGGM. We show that PCA+LVGGM can outperform using PCA+GGM and LVGGM individually when the data is corrupted by multiple sources of confounders with certain structure. We perform extensive simulations to validate the theory, compare the performance of the three methods, and demonstrate the utility of our approach in two applications. The remainder of this paper is organized as follows: In section2, we introduce the basic setup for GGM, LVGGM and PCA+GGM. A brief literature review follows. Then we describe our method PCA+LVGGM. In section3, we present theoretical results for PCA+GGM and use this result to analyze the similarities and differences between LVGGM and PCA+GGM. In section4, we demonstrate the utility of the various approaches in the simulation setting. Finally, in section5 we apply the methods on two real world data sets. q T Pp 2 Pp Notations: For a vector v = [v1; :::; vp] , define kvk2 = i=1 vi , kvk1 = i=1 jvij and kvk1 = maxi jvij. For a matrix M, let Mij be its (i; j)-th entry. Define the Frobenius norm kMkF = qP P 2 P P i j Mij, the element-wise `1-norm kMk1 = i j jMijj and kMk1 = max(i;j) jMijj. We P also define the spectral norm kMk2 = supkvk2≤1 kMvk2 and kMkL1 = maxj i jMijj. The nuclear p×p norm kMk∗ is defined as the sum of the singular values of M. When M 2 R is symmetric, its Pp T eigendecomposition is M = i=1 λivivi , where λi is the i-th eigenvalue of M, and vi is the i-th T eigenvector. We assume that λ1 ≥ λ2 ≥ ::: ≥ λp. We call λivivi the i-th eigencomponent of M. 2 Problem Setup and Methods Review 2.1 Gaussian Graphical Models T Consider a p-dimensional random vector X = (X1; :::; Xp) with covariance matrix Σ and precision matrix Ω. Let G = (V; E) be the graph associated with X, where V is the set of nodes corresponding to the elements of X, and E is the set of edges connecting nodes. The graph shows the conditional independence relations between elements of X.

Load more