Identification of Latent Variables from Graphical Model Residuals

Identiﬁcation of Latent Variables From Graphical Model Residuals

Boris Hayete *, Fred Gruber *, Anna Decker, Raymond Yan

Abstract and Scheines 1993)) - namely, that all of the relevant confounders have been observed. When this condition is not met, Graph-based causal discovery methods aim to capture condi- accurate GMs and identifiability of the causal effects are hard tional independencies consistent with the observed data and differentiate causal relationships from indirect or induced ones. to infer. Successful construction of graphical models of data depends Intuitively, the presence of unobserved confounding leads on the assumption of causal sufficiency: that is, that all con- to violations of conditional independence among the affected founding variables are measured. When this assumption is variables downstream from any latent confounder. Score- not met, learned graphical structures may become arbitrarily based methods for GM construction aim to minimize unex- incorrect and effects implied by such models may be wrongly plained variance for all variables in the network by account- attributed, carry the wrong magnitude, or mis-represent direc- ing for conditional independencies in the data (Pearl 2000; tion of correlation. Wide application of graphical models to Friedman and Koller 2013). Unobservability of a causally increasingly less curated "big data" draws renewed attention to the unobserved confounder problem. important variable will induce dependencies among its descendants that are conditionally independent of each other We present a novel method that aims to control for the latent space when estimating a DAG by iteratively deriving proxies given the latent variable. This gives rise to inferred connec- for the latent space from the residuals of the inferred model. tivity that is excessive compared to the true network (Elidan Under mild assumptions, our method improves structural infer- et al. 2001). Further, since none of the observed descendants ence of Gaussian graphical models and enhances identifiability are perfect correlates of the unobserved ancestor, some of the of the causal effect. In addition, when the model is being used information from the ancestor will "pollute" model residuals to predict outcomes, it un-confounds the coefficients on the and give rise to correlation patterns in the residual space. parents of the outcomes and leads to improved predictive per- Hitherto, the causal inference literature has largely focused formance when out-of-sample regime is very different from on addressing the subset of problems where the ACE can be the training data. We show that any improvement of prediction of an outcome is intrinsically capped and cannot rise beyond estimated reliably in the absence of this guarantee, such as a certain limit as compared to the confounded model. We when conditional exchangeability holds (Hernán and Robins extend our methodology beyond GGMs to ordinal variables 2006). In causal discovery, certain constraint-based meth- and nonlinear cases. Our R package provides both PCA and ods like the Fast Causal Algorithm (FCI) (Spirtes, Meek, autoencoder implementations of the methodology, suitable for and Richardson 1995) can deal with general latent struc- GGMs with some guarantees and for better performance in tures but do not scale well to large problems. While faster general cases but without such guarantees. variants of FCI like RFCI (Colombo et al. 2012) and FCI+ (Claassen, Mooij, and Heskes 2013) exist, score-based ap- Introduction proaches tend to perform better than purely constraint-based methods. This has been demonstrated in problems without la- arXiv:2101.02332v1 [stat.ML] 7 Jan 2021 Construction of graphical models (GMs) pursues two re- tent confounders (Nandy et al. 2018) and has been suggested lated objectives: accurate inference of conditional indepen- that this is because constraint-based methods are sensitive dencies (causal discovery), and construction of models for to early mistakes which progagate into more errors later on outcomes of interest with the purpose of estimating the aver- while in score-based approaches the mistakes are localized age causal effect (ACE) with respect to interventions (Pearl and do not affect the score of graphs later on (Bernstein et al. 2000; Hernán and Robins 2006) (causal inference). These 2020). It is likely that these limitations also affect FCI-type two distinct goals mandate that certain identifiability condi- algorithms. Furthermore, while FCI can deal with general tions for both types of tasks be met. Of particular interest to latent structures, in certain scenarios where a few latent con- us is the condition of causal sufficiency ((Spirtes, Glymour, founders drive many observed variables most observed vari- *equal contribution ables are conditionally dependent which implies dense maxi- *equal contribution mal ancestral graphs where very few edges can be oriented Copyright © 2021, Association for the Advancement of Artificial (Frot, Nandy, and Maathuis 2017) (see supplementary Figure Intelligence (www.aaai.org). All rights reserved. 11 for an illustration). Finally, another disadvantage of constraint based methods is that they generate a single model and Symbol Meaning Indexing S S , i ∈ {1, . . . , s} do not provide a score summarizing how likely the network samples i V observed predictor variables Vj, j ∈ {1, . . . , v} is given the data while score based methods can generate this U unobserved predictor variables Ul, l ∈ {1, . . . , u} type of scores by using Bayesian or ensemble approaches O outcomes (sinks) Ok, k ∈ {1, . . . , o} (Jabbari et al. 2017). Relatively computationally efficient D {V,O} - observable data matrix score-based methods have been proposed for inferring la- D¯ estimate of variables in D tent variables affecting GMs by the means of expectation from their parents D {V,O,U} maximization (EM) (as far back as (Friedman 1997, 1998)). u - implied data matrix θ parameters θi, i ∈ {1, . . . , t} However, for a large enough network, local gradients do not N N P a parents of variable N P ai , i ∈ {1, . . . , p} N N provide a reliable guide, nor do they address the cardinality of C children of variable N Ci , i ∈ {1, . . . , c} the latent space. Methods for using near-cliques for detection G graph over D of latent variables in directed acyclic graphs (DAGs) with Gu graph over Du Gaussian graphical models (GGMs) have been proposed that R residuals - matrix matching D address both problems by analyzing near-cliques in DAGs (Elidan et al. 2001; Silva et al. 2006). A method related to Table 1: Notation ours has been proposed for calculating latent variables in a greedy fashion in linear and "invertible continuous" net- Assume that the joint distribution D over the full data or works, and relating such estimates to observed data to speed u D over the observed data is factorized as a directed acyclic up structure search and enable discovery of hidden variables graph, G or G. We will consider individual conditional prob- (Elidan, Nachman, and Friedman 2007). However, with a u abilities describing nodes and their parents, P (V |P aV , θ), clique-driven approach it is impossible to tell whether any where θ refers to the parameters linking the parents of V to cliques have shared parents and, importantly, whether any ˆ signal remained to be modeled, resulting in score-based test- V . θ will refer to an estimate of these parameters. We will ing rejecting proposed "ideal parents". Additionally, Wang further assume that G is constructed subject to regularization V ˆ and Blei (2019) introduced the deconfounder approach to and using unbiased estimators for P (V |P a , θ). We will fur- detecting and adjusting for latent confounders of causal ef- ther assume that Du plus any given constraints are sufficient fects in the presence of multiple causes but in the context of a to infer the true graph up to Markov equivalence. For conve- fixed DAG under relatively strict conditions (Wang and Blei nience, we’ll focus on the actual true graph’s parameters, so ˆ 2019a,b). that, using unbiased estimators, E[θm|Du] = θm, ∀m. We show that there exist circumstances when causal suffi- Mirroring D (or Du), we will define a matrix R of the ciency can be asymptotically achieved and exchangeability same dimensions (s × (v + o)) that captures the residuals ensured even when the causal drivers of outcome are con- of modeling every variable N ∈ {V,O, (U)} via G (or Gu). founded by a number of latent variables. This can be done Note that Ru - the residuals matrix that contains residuals when confounding is pleiotropic, that is, when the latent of U - doesn’t make sense and isn’t defined. In the linear variable affects a "large enough" number of variables, some case, these would be regular linear model residuals, but driving an outcome of interest and others not (we tie this to more generally we will consider probability scale residu- the expansion property defined in (Anandkumar et al. 2013)). als (PSR, (Shepherd, Li, and Liu 2016)). That is, we define Vj ˆ Notably, this objective cannot be achieved when confounding R[i, j] = PSR(P (Vj|P a , θj)|D[i, j]), the residuals of Vj affects only the variables of interest and their causal parents given its graph parents. The use of probability-scale residuals ((D’Amour 2019)). Our approach is provably optimal given allows us to define R for all ordinal variable types, up to a proposed graph. This leads to the iterative EM-like opti- rank-equivalence. mization approach over structure and latent space proposed below. Diagnosing Latent Confounding The outline of this paper is as follows: In the first two Here, we consider the special case of GGMs where our ap- sections, we discuss the process of diagnosting latent con- proach for determining the existence of latent confounders founding in the special case of GGMs where closed-form can be written down in a closed form. local solutions exist. In the third section, we derive the im- Vj Recall that, for some V ∈ {V, U, O}, P denotes the provement cap for predictive performance of the graph thus j k kth parent of V . For GGMs, we can write down a fragment learned. In the fourth section, we discuss implementation, and j of any DAG G as a linear equation where a child node is in the fifth demonstrate the approach on simulated and real parameterized as a linear function of its parents: data. Finally, we propose some extensions to the approach Vj Vj for more complex graphical model structures and functional Vj = βj0 + βj1P1 + ··· + βjpPp + ξj, forms. ξj ∼ N (0, σj).

Background And Notation For example, consider O1 in Figure 1. We can write: Consider a factorized joint probability distribution over a O1 = β0 + β6V6 + β5V5 + β4V4 + β7V7 + βuU + ξ1, set of observed and latent variables Table 1 summarizes the ξ ∼ N (0, σ ). notation to be used to distinguish variables, their relevant 1 O1 partitions , and parameters. (1) other variables in {P U ,CU }. Still, unless U is completely explained by {P U ,CU } (as described in (D’Amour 2019))

V1 and in the absence of regularization (when a high enough number of covariates may lead to such collinearity), RN will U not fully disappear in G. Hence, even after partially explain- ing the contribution of U to N by some of the parents of N in G, V2 V5 V3 RN = β0 + β1U + ξN . (4) O2 V6 V7 V4 Variables O1 V11 V12 ... V1v ...... Vs1 Vs2 ... Vsv Samples Residuals Figure 1: Graph Gu. U inﬂuences the outcomes O, and a R11 R12 ... R1v number of predictors V , confounding many of the Vj → Ok ...... relationships. Gray nodes are affected by U. Rs1 Rs2 ... Rsv Samples Table 2: The training data frame (Variables) implies a match-

V1 ing data frame (Residuals) once the joint distribution of all variables is speciﬁed via a graph and its parameterization

V4 Therefore, the columns in the residuals table corresponding to G (Table 2) that represent the parents and children of U will contain residuals collinear with U: V6 V2 R1 = β10 + β11U + ξ1

V7 V5 ... (5) Rk = βk0 + βk1U + ξk. O O 1 2 Rearranging and combining, ∗ ∗ U = βi R1 + ··· + βkRk + ξ = BR + ξ. (6)

Figure 2: Graph G. With U latent, the graph adjusts, intro- Equation 6 tells us that, for graphical Gaussian models, ducing spurious edges. components of U are obtainable from linear combinations of residuals, or principal components (PCs) of the residual table R. In other words, U is identifiable by principal component For any variable N that has parents in Gu, we can group analysis (PCA). Whether the residuals needed for this identi- N U U fication exist depends on the expansion property as defined variables in P into three subsets: XU ∈ {P ,C }, XU ∈/ {P U ,CU }, and the set U itself, and write down the following in (Anandkumar et al. 2013). general form using matrix notation: Theorem 1 (Inferrability). Posit that we know the true DAG G over D, but not U = Gu \ G, the latent space’s edges N = βN0 + BU XU + B X + βU U + ξN , U U (2) to observed variables. Assume that L is orthogonal, with ξN ∼ N (0, σN ). no variable in D a parent of any variable in U. Define as R = CU,G − C¯U,G the residuals of all children of U that Explicit dependence of N on U happens when β 6= 0. C U also have parents in G. If relationships described by G are Note that if we deleted U and its edges from Gu without linear, we can infer U from RC up to sign and scale. relearning the graph, Equation 2 from Gu would read: Proof. For proof, refer to supplement, theorem 1. N = βN0 + BU XU + BU XU + RN + ξN (3) where the residual term RN is simply equal to the direct Theorem 1 does require G to be known, which will give contribution of U to N. rise to an iterative algorithm later on. However, from it we see Now consider G, the graph learned over the variables that focusing on residuals of a graph permits identification of {V,O} excluding the latent space U. The network G would the latent space in some circumstances. Supplementary theo- have to adjust to the missingness of U (e.g., Figure 2 vs rem 2 relaxes the need for orthogonality when latent space’s Figure 1). As a result, RN will be partially substituted by structure is irrelevant but controlling for it is important. Note that this approach is similar, in the linear case, to contrast three expressions (from G and GU¯ respectively): that in (Elidan, Nachman, and Friedman 2007) except insofar as we show the principal components of the entire residual Oi = βi0 + BW W (a) + ξi (a) matrix to be optimal for the discovery of the whole latent Oi = βi0 + BW W + BX\W (X \ W ) + ξi (b) (9) space of a given DAG, and thus couple structure inference OU = βU + BU Z + BU U + ξ (c). and latent space discovery via EM (1). i i0 Z i Model (a) is the model that was actually accepted, subject to Capping improvement regularization, in G. Model (b) is the "complete" model of O that controls for U non-parsimoniously by controlling for The most important utility of causal modeling is to identify i all variables affected by U and not originally in the model. drivers as well as drivers of drivers of outcomes of interest. The third model, (c), is the ideal parsimonious model when These drivers of drivers may have practical importance. For U is known. We can compare the quality of these models via example, in the development of a drug, direct predictors of the Bayesian Score, and the full score, which can approxi- an outcome (e.g. V pointing to O in Figure 1) may not 6 1 mated by the Bayesian Information Criterion (BIC) in large be viable targets but upstream variables (e.g. V ) may in 5 samples (Koller and Friedman 2009). We assume that model fact be be promising targets and the direct parents of the (c) would have the lowest BIC (being the best model), and outcome mediate the effect of these potential targets. In the model (a) would be slightly better, since we know that the presence of latent confounding, the identifiability of direct set of variables X \ W didn’t make it into the first equation causal effects of outcomes is also jeopardized (Hernán and subject to regularization by BIC. Assuming n samples, Robins 2006). For example, in Figure 2 misisng U induces apparent correlation among V3, V4, V5, and V6 even though some of these variables are conditionally independent given ba = BIC(Oi = βi0 + BW W + ξi) (10) U. bb = BIC(Oi = βi0 + BW W (11) The extent of the problem of unmeasured confounding can + BX\W (X \ W ) + ξi) be quantified in more detail. Suppose we model O1 without controlling for U (2): = bc + |X \ W |log(n) U U U U O1 = β0 + β3V3 + β4V4 + β5V5 + β6V6 + .... bc = BIC(Oi = βi0 + BZ Z + B U + ξi) (12) Setting the coefficient of determination for the model The "complete model" - model (b) - includes all of the true predictors of Oj. Therefore its score will be the same as that V3 = α0 + α4V4 + α5V5 + α6V6 + .... of the true model - model (c) - plus the BIC penalty, log(n), U 2 for each extra term, minus the cost of having in the true equal to ρ3. Then the estimated variance of β3 in the presence model (that is, the cardinality of the relevant part of the latent of collinearity can be related to the variance when collinearity space). We know that the extra information carried by this is absent via the following formula (Rawlings, Pantula, and model was not big enough to improve upon model (a), that is Dickey 1998): ba < bc + k log(n) for some k. Rearranging: 1 1 var(β¯ ) = var(β ) ∝ . (7) bc − ba > −k log(n). (13) 3 3 1 − ρ2 1 − ρ2 3 3 ¯ Any improvement in GU¯ owing to modeling of U cannot, Formula 7 describes the variance inflation factor (VIF) of k log(n) k = |X \ W | − |U| 1 therefore, exceed logs, where : β3. Note that limρ→1 1−ρ2 = ∞, so even mild collinearity the information contained in the "complete" model is smaller induced by latent variables can severely distort coefficient than its cost. values and signs, and thus estimation of ACE. Our approach Although the available improvement in predictive power is reduces the VIFs of coefficients related to outcomes and thus also capped in some way, it is still important to aim for that make all causal statements relating to outcomes, such as limit. The reason is, correct inference of causality, especially calculation of ACE, more reliable, by controlling for U¯ - the in the presence of latent variables, is the only way to ensure estimate of U - in the network, transportability of models in real-world (heterogeneous-data) ¯ applications (see, e.g., (Bareinboim and Pearl 2016)). lim var(βi) = var(βi). (8) (U−U¯)→0 This approach is a generalization of work presented in (Anandkumar et al. 2013), where the authors show that, un- Consider an output Oj. While it is difficult to describe the der some assumptions. the latent space can be learned exactly, limit of error on the coefficients of the drivers of Oj, it is which is also related to the deconfounder approach described straightforward to put a ceiling on the improvement in the by (Wang and Blei 2019a). However, our approach does not likelihood obtainable from modeling U and approximating require that the observables be conditionally independent Gu with Gu¯. Suppose we eventually model U as a linear given the latent space and instead generate such indepen- combination of a set of variables X ⊂ V , and denote by dence by the use of causal network’s residuals, which are X \ W the set difference: members of X not in W . Then conditionally independent of each other given the graph and for any outcome Oi predicted by a set of variables W in the the latent space. However, since the network among the ob- graph G and in truth predicted by the set Z + U, we can servables is undefined in the beginning, the structure of the observable network must be learned at the same time as (PCA), and take this number as a useful "ceiling" for the the structure of the latent space, which leads us to the iter- dimensionality of the non-linear latent space, on the as- ative/variational bayes approach presented in Algorithm 1. sumption that non-linear features are more compact and Lastly, the use of the entirety of the residual space is different require lower dimensionality if discovered. We then con- from the work described in (Elidan, Nachman, and Friedman structed an autoencoder with 4 hidden layers where the max- 2007), where local residuals are pursued with the goal to ac- imum number of nodes in the hidden layers was capped celerate structure learning while simultaneously discovering at min(100, number of variables), the cap being dictated by the latent space. practical considerations. The autoencoder was implemented using Keras with Ten- Implementation sorflow backend and called within R using the Reticulate package (Ushey 2020). The encoders and decoder were kept Algorithm 1 below describes our approach to learning the symmetric and in order improve the stability we used tied latent space and can be viewed as a type of an expectation- weights (Berlinkov 2018). In addition, the coding layer had maximization algorithm. additional properties borrowed from PCA including a kernel U¯ = f(R ¯ ) How do we learn U ? In the linear case, we can regularizer promoting orthogonality between weights and an use PCA, as described above, and in the non-linear case, we activity regularizer to promote uncorrelated encoded features can use non-linear PCA, autoencoders, or other methods, as (see (Ranjan 2019) for details on implementation and jus- alluded to above. However, the linear case provides a useful tification). This last property is of particular interest in our constraint on dimensionality, and this constraint can be de- application since ideally every dimension in our latent space rived quickly. A useful notion of the ceiling constraint on the should be associated with a different latent variable. The linear latent space dimensionality can be found in (Gavish hidden layers used a sigmoidal activation except for the out- and Donoho 2014). From a practical standpoint, the dimen- put layer which had a linear activation. All layers had batch sionality can be set even tighter, and we use a previously normalization (Ioffe and Szegedy 2015) (see supplementary described heuristic approach(Buja and Eyuboglu 1992). Figure 3 for a diagram of the architecture). For more details see the implementation in the github Algorithm 1 Learning U¯ from structure residuals via EM repository for this paper ((github 2020)). Data: The set of observed variables {V,O} ¯ Result: Graph GU¯ (V,O, U) Numerical Demonstration Construct G = G(V,O) Synthetic Data Compute S0 = BICG Estimate U¯ = f(R) To illustrate the algorithms described in the previous sec- ¯ tions we generated synthetic data from the network shown in Construct GU¯ = G(V,O, U) Figure 3 (also supplementary Figure 4) where two variables Compute SU¯ = BIC(GU¯ ) V1 and V2 drive an outcome Z. Two confounders U1 and while SU¯ − S0 > do U2 affect both the drivers and the outcome, as well as many Set S0 = SU¯ additional variables that do not affect the outcome Z. The Calculate RU¯ : Set U¯ to arbitrary constant values coefficient values in the network were chosen so that faith- ¯ fullness (also knows as stability (Pearl 2000)) was achieved for all child node C ∈ GU¯ ,C/∈ U do Set parents to training data allowing the structure and coefficients to be approximately C¯ = C|parents(C) recovered when all variables were observed (see supplement for additional examples). Set RC = PSR(C,C¯ ) end for The underlying network inference needed for the algorithm ¯ was implemented by bootstrapping the data and running the Estimate U = f(RU¯ ) ¯ R package bnlearn ((Scutari 2010) also see code in (github Construct GU¯ = G(V,O, U) 2020) for specific settings of the run.) on each bootstrap. The Compute SU¯ = BIC(GU¯ ) end while resulting ensemble of networks can be combined to obtain a consensus network where only the most confident edges are kept. Similarly, the coefficient estimates can be obtained by averaging them over bootstraps. Nonlinear Extension: Autoencoder For this example, using the complete data (no missing All results described above refer to Gaussian or at most rank- variables) the consensus network created with edges with monotonic relationships, and perhaps extend to linear models confidence larger than 40% recovers the true structure (see with interactions, when interactions can be seen as "synthetic supplementary Figure 5), and the root mean square error features". Real-world data, however, often does not behave (RMSE) in the coefficient estimates is 0.05 (not shown). This this way requiring an approach that might generalize beyond represents a lower bound on the error that we can expect to monotonic relationships. Therefore, we pursued latent space obtain under perfect reconstruction of the latent space. discovery using autoencoders. When the confounders are unobserved, the reconstruction We first assess the cardinality of the latent space (num- of the network introduces many false edges and results in a ber of nodes in the coding layer) using the linear approach five-times-larger RMSE. Figure 4 (see also supplementary 1.00 Figure 6) shows the reconstructed network, where the red ● ● 0.51 ● ● ● ● ● ● ● ● ● ● ● ● edges are the true edges between V1 and V2 and the outcome ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.75 ● ● ● ● Z ● Variable RMSE V1 V2 . ● ● ● ● ● We ran 1 for 20 iterations using PCA to reconstruct the 0.50 ● ●

2 0.50 latent space from the residuals with the assumption that the R ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● latent variables were source nodes. We then tracked the latent Error in Coefficients ● 0.49 variable reconstruction in an out-of-sample test set as well as 0.25 ● ● U1 ● U2 ● ● ● ● ● ● ● ● ● ● ● the error in the coefficient’s estimates. Figure 6 (supplemen- ● ● ● ● ● ● ● ● ● tary Figure 7) shows the adjusted R2 between each of the ● 0.00 5 10 15 20 0 5 10 15 20 true latent variables and the prediction obtained from the esti- Iteration Iteration mated latent space across iterations. The lines and error bands R2 are calculated using locally estimated scatterplot smoothing Figure 6: in the pre- Figure 7: Error in coeffi- (LOESS). The estimated latent space is predictive of both the diction of the latent vari- cients as a function of the latent variables, and the iterative procedure improves the R2 able from the selected iterations. principal components. with respect to U1 from 0.49 at the first iterations to about 0.505 in about 3 iterations. Figure 7 (also supplementary Figure 8) shows the total error in the coefficients of all variables connected to the Excessive CAG repeats cause Huntingont’s disease via outcome Z (RMSE) in the infered network as well as the a complex and not entirely understood mechanism, where error in the coefficients of the true drivers of Z, V1 and V2. disease age of onset and severity strongly correlate with re- The dashed lines show the error levels when all variables are peat length. Furthermore, CAG influences a large number of observed. gene expression and protein level markers, based on bayesian Figure 5 (and supplementary Figure 7) shows the final modeling - mostly indirectly. Therefore CAG is an ideal inferred network at iteration 20. The number of edges arriv- pleiotropic covariate of a type that our algorithm should be ing to the outcome was reduced considerably with respect able to uncover if it were missing. However, the high bio- to the network prior to inferring latent variables (Figure 4 logical noise, low sample size, and the presence of ties and and supplementary Figure 9). In addition, the coefficients near-ties in CAG values leave enough difficulty in the prob- connecting V1 and V2 to the outcome are now closer to their lem. Since we didin’t have an outcome with a lot of signal true values (Figure 7). This represents an improvement in on which to test effects of latent confounding, we focused ACE estimation. on inference of the latent variable as our measure of success

V1 in this test (Supplementary Algorithm 1). We developed an

V2 LV_1 LV_2

V38 V46 V3 V18 autoencoder approach to inference for cases latent space may

V47 V8 V28 V16 V39

V36 V27 V12 V15 relate to observables in a very non-linear manner. In this case, V21 V24 V19 V23 V6 V15 V1 V14 V40 V29 V20 V32 V7 V24 we begin to see marginally better performance for autoen- V31 0.5 V5 U1 V1 1.9 0.87 1 1.9 V23 1.6 V16 V17 Z 0.5 V34 V35 0.5 V14 1 V4 V37 coder, though on simulated data linear approach works better, V2 1.6 U2 1.4 V26 0.5 V43 V2

V48 V40 V10

V22 V42 V33 V13 V45 as expected (Figure 8). Given the small number of samples, 1.4 V50 V9 V44 V10 V37 V27 V25 V30 V18 a problem for deep learning, we believe that autoencoder V21 V11 V49 Z V41 V45 Z is a promising approach for real-world datasets, especially biological data. Figure 3: True Figure 4: Estimated Figure 5: Estimated network. network when U1 network at the last type ● autoencoder ● linear and U2 are unob- iteration of algo- 5.5 served. rithm 1. ● ● ●

● ● ● ● ● ● ● ● ● ● ● 5.0 ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● Experimental Data ● ● ● ● ● ● ● ● ● ●●● ● ●●● ●● ● 4.5 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ●●● ● ● ●● ● ●● ● ● ● ● ● In order to assess the efﬁcacy of our approach to latent vari- ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●

pred ● ● ● ● ● ● ● ● ● ● ● able inference on realistic data, we picked a dataset of a ● ● ● ● 4.0 ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● type that is typically hard to analyze. The dataset in ques- ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● tion was obtained from the Cure Huntington’s Disease Initia- 3.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● tive (CHDI) foundation (Langfelder et al. 2016). It captures ● ● ● ● ● ● ● ● ● molecular phenotyping and length of Huntingtin gene’s CAG 3.0 repeat expansion in CAG knock-in mice and consists of stria- ● 2 3 4 5 tum gene expression and proteomic data as well as other mea- CAG surements. For this study, we extracted CAG, mouse weight and age, and a number of gene expression and proteomic Figure 8: Autoencoder predicts CAG slightly better than measurements correlated with CAG, so that gene expression linear PCA would also have matching proteomic data, for a total of 249 variables.over 153 samples. Generalizations of this approach When U consists of multiple variables that are indepen- In this section, we discuss extensions to this approach for dent of each other, the relationship between N and U can GGMs with interactions, nonlinear functional forms, and be written down using the mutual information chain rule categorical variables. (MacKay 2003) and simplified taking advantage of mutual independence of the latent sources: Gaussian Graphical Models With Interactions I(N; U) = I(N; U1,U2,...,UU ) In the presence of interactions among variables in a GGM, u u equation 4 expressing the deviation of residuals from Gaus- X X (17) = I(N; X |X ,...,X ) = I(N; X ). sian noise may acquire higher-order terms due to interactions i i−1 1 i among the descendants of the latent space U: i=1 i=1 If interactions among U are present, it may still be possible 2 3 RN = β0 + β1U + β2U + β2U + .... (14) to approximate the latent space with a suitably regularized nonlinear basis, but we do not, at present, know of specific k Assuming interactions up to th power are present in the conditions when this may or may not work. Novel methods system being modeled, residuals for each variable may have for encoding basis sets, such as nonlinear PCA (implemented k up to terms in the model matrix described by equation in the accompanying code), autoencoders, and others, may U 14, and if interactions among variables in the latent space be brought to bear to collapse the linearly independent basis also exist, the cardinality of the principal components of the down to non-linearly independent (i.e. in the mutual informa- residuals may far exceed the cardinality of the underlying tion sense) components. latent space. Nevertheless, it may be possible to reconstruct While approximate inference of latent variables for GMs a parsimonious basis vector by application of regularization built over invertible functions had been noted in (Elidan, and nonlinear approaches to latent variable modeling, such Nachman, and Friedman 2007), the above method gives a as nonlinear PCA (e.g. using methods from (Karatzoglou direct rank-linear approach leveraging the recently-proposed et al. 2004)), or autoencoders (Louizos et al. 2017) as will be PSRs. Similar ideas pertaining to the Ideal Parents approach discussed below. and involving copula transforms have been outlined in (Ten- Generalization to nonlinear functions zer and Elidan 2016). We can show that linear PCA will suffice for a set of models Generalization to categorical variables broader than GGMs. In particular, we will focus on nonlinear but homeomorphic functions within the Generalized Additive In principle, PSRs can be extended to the case of non-ordinal Model (GAM) family. When talking about multiple inputs, categorical variables by modeling binary in/out of class label, we will require that the relationship of any output variable to deviance being correct/false. These models would lack the any of the covariates in equation 4 is homeomorphic (invert- smooth gradient allowed by ranks and would probably con- ible), and that equation 6 can be marginalized with respect verge far worse and offer more local minima for EM to get to any right-hand-side variable as well as to the original stuck in. left-hand side variable. For such class of transformations, mutual information between variables, such as between a single Conclusions and Future Directions confounder U and some downstream variable N, is invariant In this work we present a method for describing the latent (Kraskov, Stögbauer, and Grassberger 2004). Therefore, resid- variable space which is locally optimal under linearity, up to uals of any variable N will be rank-correlated to rank(U) rank-linearity. The method does not place a priori constraints in a transformation-invariant way. Further, spearman rank- on the number of latent variables, and will infer the upper correlation, specifically, is defined as pearson correlation of bound on this dimensionality automatically. ranks, and pearson correlation is a special case of mutual in- In real-data situations, some assumptions of our linear formation for bivariate normal distribution. Therefore when approach may not hold - PCA and related methods may not talking about mutual information between ranks of arbitrarily suffice. For such cases, we provide a heuristic autoencoder distributed variables, we can use our results for the GGM implementation. Autoencoders have shown utility when the case above. structure of the causal model is already known (Louizos Thus, equation 4 will apply here with some modifications: et al. 2017). Learning structure over observed data as well as rank(RN ) = β0 + β1rank(U) + ξN . (15) unconstrained latent space via autoencoders, as in our work, results in a hybrid "deep causal discovery" generalization of Since a method has been published recently describing how that approach. to capture rank-equivalent residuals (aka probability-scale Many domains can benefit from joint causal discovery and residuals, or PSR) for any ordinal variable (Shepherd, Li, and inference of latent space. For instance, in epidemiology nei- Liu 2016), we can modify the equation 6 to reconstruct latent ther the model structure nor the latent space are typically space up to rank-equivalence when interactions are absent well-understood beyond simple examples. Multi-omic bio- from the network. logical datasets are another area of applicability, since it’s 1 1 impossible to collect all biological data modalities in prac- rank(U) = rank(Ri) + rank(Rj) + ··· + ξ. (16) tice. In clinical trials, epidemiological features of multi-omic βi βj datasets may be hard to fully account for. Combining causal discovery and causal inference improves Frot, B.; Nandy, P.; and Maathuis, M. H. 2017. Robust Causal statements about ACE, and helps assess data set limitations. Structure Learning with Some Hidden Variables. arXiv:1708.01151 Further, we provide a relatively fast implementation suitable [stat] . to real problems. Therefore, we hope that our work will open Gavish, M.; and Donoho, D. L. 2014. The Optimal Hard Threshold new practical applications of both paradigms. for Singular Values is \(4/\sqrt {3}\). IEEE Transactions on Infor- mation Theory 60(8): 5040–5053. ISSN 0018-9448, 1557-9654. References doi:10.1109/TIT.2014.2323359. URL http://ieeexplore.ieee.org/ document/6846297/. Anandkumar, A.; Hsu, D.; Javanmard, A.; and Kakade, S. 2013. Learning Linear Bayesian Networks with Latent Variables. In Das- github. 2020. Latent Confounder. URL https://github.com/rimorob/ gupta, S.; and McAllester, D., eds., Proceedings of the 30th Interna- netres. Original-date: 2020-02-05T03:35:01Z. tional Conference on Machine Learning, volume 28 of Proceedings Hernán, M. A.; and Robins, J. M. 2006. Estimating causal ef- of Machine Learning Research, 249–257. Atlanta, Georgia, USA: fects from epidemiological data. Journal of Epidemiology and PMLR. URL http://proceedings.mlr.press/v28/anandkumar13.html. Community Health 60(7): 578–586. ISSN 0143-005X. doi: Bareinboim, E.; and Pearl, J. 2016. Causal inference and the data- 10.1136/jech.2004.029496. fusion problem. Proceedings of the National Academy of Sciences Ioffe, S.; and Szegedy, C. 2015. Batch Normalization: Accelerating 113(27): 7345–7352. ISSN 0027-8424, 1091-6490. doi:10.1073/ Deep Network Training by Reducing Internal Covariate Shift. pnas.1510507113. URL http://www.pnas.org/lookup/doi/10.1073/ Jabbari, F.; Ramsey, J.; Spirtes, P.; and Cooper, G. 2017. Discovery pnas.1510507113. of Causal Models That Contain Latent Variables through Bayesian Berlinkov, M. 2018. python - Tying Autoencoder Weights in a Dense Scoring of Independence Constraints. Joint European Conference Keras Layer. URL https://stackoverflow.com/questions/53751024/ on Machine Learning and Knowledge Discovery in Databases 142– tying-autoencoder-weights-in-a-dense-keras-layer. 157. doi:10.1007/978-3-319-71246-8_9. Bernstein, D.; Saeed, B.; Squires, C.; and Uhler, C. 2020. Ordering- Karatzoglou, A.; Smola, A.; Hornik, K.; and Zeileis, A. 2004. kern- Based Causal Structure Learning in the Presence of Latent Variables. lab - An S4 Package for Kernel Methods in R. Journal of Statistical In International Conference on Artificial Intelligence and Statistics, Software 11(9). ISSN 1548-7660. doi:10.18637/jss.v011.i09. URL 4098–4108. PMLR. http://www.jstatsoft.org/v11/i09/. Buja, A.; and Eyuboglu, N. 1992. Remarks on Parallel Analysis. Koller, D.; and Friedman, N. 2009. Probabilistic graphical models: Multivariate Behavioral Research 27(4): 509–540. ISSN 0027- principles and techniques. Adaptive computation and machine 3171, 1532-7906. doi:10.1207/s15327906mbr2704_2. URL http: learning. Cambridge, MA: MIT Press. ISBN 978-0-262-01319-2. //www.tandfonline.com/doi/abs/10.1207/s15327906mbr2704_2. Kraskov, A.; Stögbauer, H.; and Grassberger, P. 2004. Estimating Claassen, T.; Mooij, J.; and Heskes, T. 2013. Learning Sparse mutual information. Physical Review E 69(6). ISSN 1539-3755, Causal Models Is Not NP-Hard. In Proceedings of the Twenty-Ninth 1550-2376. doi:10.1103/PhysRevE.69.066138. URL https://link. Conference on Uncertainty in Artificial Intelligence, 172–181. URL aps.org/doi/10.1103/PhysRevE.69.066138. http://arxiv.org/abs/1309.6824. Langfelder, P.; Cantle, J. P.; Chatzopoulou, D.; Wang, N.; Gao, F.; Colombo, D.; Maathuis, M. H.; Kalisch, M.; and Richardson, T. S. Al-Ramahi, I.; Lu, X.-H.; Ramos, E. M.; El-Zein, K.; Zhao, Y.; 2012. Learning High-Dimensional Directed Acyclic Graphs with Deverasetty, S.; Tebbe, A.; Schaab, C.; Lavery, D. J.; Howland, Latent and Selection Variables. The Annals of Statistics 40(1): D.; Kwak, S.; Botas, J.; Aaronson, J. S.; Rosinski, J.; Coppola, 294–321. ISSN 0090-5364. doi:10.1214/11-AOS940. G.; Horvath, S.; and Yang, X. W. 2016. Integrated genomics and proteomics to define huntingtin CAG length-dependent networks D’Amour, A. 2019. On Multi-Cause Causal Inference with Unob- in HD Mice. Nature neuroscience 19(4): 623–633. ISSN 1097- served Confounding: Counterexamples, Impossibility, and Alterna- 6256. doi:10.1038/nn.4256. URL https://www.ncbi.nlm.nih.gov/ tives. arXiv:1902.10286 [cs, stat] URL http://arxiv.org/abs/1902. pmc/articles/PMC5984042/. 10286. ArXiv: 1902.10286. Louizos, C.; Shalit, U.; Mooij, J.; Sontag, D.; Zemel, R.; and Elidan, G.; Lotner, N.; Friedman, N.; and Koller, D. 2001. Dis- Welling, M. 2017. Causal Effect Inference with Deep Latent- covering Hidden Variables: A Structure-Based Approach. In Variable Models. arXiv:1705.08821 [cs, stat] URL http://arxiv. Leen, T. K.; Dietterich, T. G.; and Tresp, V., eds., Advances org/abs/1705.08821. ArXiv: 1705.08821. in Neural Information Processing Systems 13, 479–485. MIT Press. URL http://papers.nips.cc/paper/1940-discovering-hidden- MacKay, D. J. C. 2003. Information theory, inference, and learning variables-a-structure-based-approach.pdf. algorithms. Cambridge, UK ; New York: Cambridge University Press. ISBN 978-0-521-64298-9. Elidan, G.; Nachman, I.; and Friedman, N. 2007. “Ideal Parent” Structure Learning for Continuous Variable Bayesian Networks. Nandy, P.; Hauser, A.; Maathuis, M. H.; et al. 2018. High- Journal of Machine Learning Research 8: 35. Dimensional Consistency in Score-Based and Hybrid Structure Learning. The Annals of Statistics 46(6A): 3151–3183. Friedman, N. 1997. Learning belief networks in the presence of Causality: models, reasoning, and inference missing values and hidden variables. In ICML, volume 97, 125–133. Pearl, J. 2000. . Cam- bridge, U.K.; New York: Cambridge University Press. ISBN 978-1- Friedman, N. 1998. The Bayesian structural EM algorithm. In Pro- 139-64936-0 978-0-511-80316-1. URL http://dx.doi.org/10.1017/ ceedings of the Fourteenth conference on Uncertainty in artificial CBO9780511803161. OCLC: 834142635. intelligence, 129–138. Ranjan, C. 2019. Build the right Autoencoder — Tune Friedman, N.; and Koller, D. 2013. Being Bayesian about Network and Optimize using PCA principles. Part II. URL Structure. arXiv:1301.3856 [cs, stat] URL http://arxiv.org/abs/1301. https://towardsdatascience.com/build-the-right-autoencoder- 3856. ArXiv: 1301.3856. tune-and-optimize-using-pca-principles-part-ii-24b9cca69bd6. Rawlings, J. O.; Pantula, S. G.; and Dickey, D. A. 1998. Applied regression analysis: a research tool. Springer texts in statistics. New York: Springer, 2nd ed edition. ISBN 978-0-387-98454-4. Scutari, M. 2010. Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software 35(3). ISSN 1548-7660. doi:10.18637/jss.v035.i03. URL http://www.jstatsoft.org/v35/i03/. Shepherd, B. E.; Li, C.; and Liu, Q. 2016. Probability-scale residuals for continuous, discrete, and censored data. The Canadian journal of statistics = Revue canadienne de statistique 44(4): 463–479. ISSN 0319-5724. doi:10.1002/cjs.11302. URL https://www.ncbi.nlm.nih. gov/pmc/articles/PMC5364820/. Silva, R.; Scheines, R.; Glymour, C.; and Spirtes, P. 2006. Learning the Structure of Linear Latent Variable Models. J. Mach. Learn. Res. 7: 191–246. Spirtes, P.; Glymour, C. N.; and Scheines, R. 1993. Causation, prediction, and search. Number 81 in Lecture notes in statistics. New York: Springer-Verlag. ISBN 978-0-387-97979-3. Spirtes, P. L.; Meek, C.; and Richardson, T. S. 1995. Causal In- ference in the Presence of Latent Variables and Selection Bias. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, 499–506. URL http://arxiv.org/abs/1302.4983. Tenzer, Y.; and Elidan, G. 2016. Generalized Ideal Parent (GIP): Discovering non-Gaussian Hidden Variables. In Gretton, A.; and Robert, C. C., eds., Proceedings of Machine Learning Research, volume 51, 222–230. Proceedings of Machine Learning Research: PMLR. URL http://proceedings.mlr.press. Ushey, K. 2020. rstudio/reticulate. URL https://github.com/rstudio/ reticulate. Original-date: 2017-02-06T18:59:46Z. Wang, Y.; and Blei, D. M. 2019a. The Blessings of Multiple Causes. Journal of the American Statistical Association 114(528): 1574– 1596. Wang, Y.; and Blei, D. M. 2019b. Multiple Causes: A Causal Graphical View. arXiv:1905.12793 [cs, stat] .