Gaussian Process Structural Equation Models with Latent Variables

Ricardo Silva Robert B. Gramacy Department of Statistical Science Statistical Laboratory University College London University of Cambridge [email protected] [email protected]

Abstract tent space (with confidence regions) for ranking, clustering and visualization; density estimation; imputa- In a variety of disciplines such as social sci- tion; and causal inference(Pearl, 2000;Spirtes et al., 2000). ences, , and , the This paper introduces a nonparametric formulation of recorded data are considered to be noisy mea- SEMs with hidden nodes, where functions connecting la- surements of latent variables connected by some tent variables are given a Gaussian process prior. An effi- causal structure. This corresponds to a fam- cient but flexible sparse formulation is adopted. To the best ily of graphical models known as the structural of our knowledge, our contribution is the first full Gaussian equation model with latent variables. While process treatment of SEMs with latent variables. linear non-Gaussian variants have been well- studied, inference in nonparametric structural We assume that the model graphical structure is given. equation models is still underdeveloped. We in- Structural model selection with latent variables is a com- troduce a sparse Gaussian process parameteriza- plex topic which we will not pursue here: a detailed dis- tion that defines a non-linear structure connect- cussion of model selection is left as future work. As- ing latent variables, unlike common formulations parouhov and Muth´en (2009) and Silva et al. (2006) discuss of Gaussian process latent variable models. The relevant issues. Our goal is to be able to generate poste- sparse parameterization is given a full Bayesian rior distributions over parameters and latent variables with treatment without compromising Markov chain scalable sampling procedures with good mixing properties, Monte Carlo efficiency. We compare the stabil- while being competitive against non-sparse Gaussian pro- ity of the sampling procedure and the predictive cess models. ability of the model against the current practice. In Section 2, we specify the likelihood function for our structural equation models and its implications. In Sec- tion 3, we elaborate on priors, Bayesian learning, and a 1 CONTRIBUTION sparse variation of the basic model which is able to handle larger datasets. Section 4 describes a Markov chain Monte A cornerstone principle of many disciplines is that obser- Carlo (MCMC) procedure. Section 5 evaluates the useful- vations are noisy measurements of hidden variables of in- ness of the model and the stability of the sampler in a set of terest. This is particularly prominent in fields such as so- real-world SEM applications with comparisons to modern cial sciences, psychology, marketing and medicine. For alternatives. Finally, in Section 6 we discuss related work. instance, data can come in the form of social and eco- nomical indicators, answers to questionnaires in a medical exam or marketing survey, and instrument readings such as 2 THE MODEL: LIKELIHOOD fMRI scans. Such indicators are treated as measures of la- tent factors such as the latent ability levels of a subject in Let G be a given directed acyclic graph (DAG). For sim- a psychological study, or the abstract level of democrati- plicity, in this paper we assume that no observed variable zation of a country. The literature on structural equation is a parent in G of any latent variable. Many SEM appli- models (SEMs) (Bartholomew et al., 2008; Bollen, 1989) cations are of this type (Bollen, 1989; Silva et al., 2006), approaches such problems with directed graphical models, and this will simplify our presentation. Likewise, we will where each node in the graph is a noisy function of its par- treat models for continuous variables only. Although cyclic ents. The goals of the analysis include typical applications SEMs are also well-defined for the linear case (Bollen, of latent variable models, such as projecting points in a la- 1989), non-linear cyclic models are not trivial to define and IL PDL ζ IL ζ

Y1 YYYYYY2 3 4 5 6 7 Y1 YYYYYY2 3 4 5 6 7

ε ε 1 ε2 ε 3 ε 4 ε 5 ε 6 ε7 1 ε2 ε 3 ε 4 ε 5 ε 6 ε7 (a) (b)

Figure 1: (a) An example adapted from Palomo et al. (2007): latent variable IL corresponds to a scalar labeled as the industrialization level of a country. PDL is the corresponding political democratization level. Variables Y1, Y2, Y3 are indicators of industrialization (e.g., gross national product) while Y4,...,Y7 are indicators of democratization (e.g., expert assessements of freedom of press). Each variable is a function of its parents with a corresponding additive error term: i for each Yi, and ζ for democratization levels. For instance, PDL = f(IL)+ζ for some function f(·). (b) Dependence among latent variables is essential to obtain sparsity in the measurement structure. Here we depict how the graphical dependence structure would look like if we regressed the observed variables on the independent latent variables of (a). as such we will exclude them from this paper. much has been done on exploring nonparametric models with dependent latent structure (a loosely related excep- Let X be our set of latent variables and X ∈ X be a partic- i tion being dynamic systems, where filtering is the typical ular latent variable. Let X be the set of parents of X in Pi i application). Figure 1(b) illustrates how modeling can be G. The latent structure in our SEM is given by the follow- affected by discarding the structure among latents2. ing generative model: if the parent set of Xi is not empty, X Xi = fi( Pi )+ ζi, where ζi ∼ N (0, vζi ) (1) 2.1 Identifiability Conditions Latent variable models might be unidentifiable. In the con- N (m, v) is the Gaussian distribution with mean m and text of Bayesian inference, this is less of a theoretical issue variance v. If X has no parents (i.e., it is an exogenous i than a computational one: unidentifiable models might lead latent variable, in SEM terminology), it is given a mixture to poor mixing in MCMC, as discussed in Section 5. More- of Gaussians marginal1. over, in many applications, the latent embedding of the data The measurement model, i.e., the model that describes the points is of interest itself, or the latent regression functions distribution of observations Y given latent variables X , is are relevant for causal inference purposes. In such appli- X as follows. For each Yj ∈ Y with parent set Pj , we have cations, an unidentifiable model is of limited interest. In this Section, we show how to derive sufficient conditions XT Yj = λj0 + Pj Λj + j , where j ∼ N (0, vj ) (2) for identifiability.

Consider the case where a latent variable Xi has at least Error terms {j } are assumed to be mutually independent three unique indicators Yi ≡ {Yiα, Yiβ , Yiγ }, in the sense and independent of all latent variables in X . Moreover, Λj T that no element in Yi has any other parent in G but Xi. It is is a vector of linear coefficients Λj = [λj1 ... λ X ] . j| Pj | known that in this case (Bollen, 1989) the parameters of the Following SEM terminology, we say that Yj is an indicator structural equations for each element of Y are identifiable X i of the latent variables in Pj . (i.e., the linear coefficients and the error term variance) up An example is shown in Figure 1(a). Following the nota- to a scale and sign of the latent variable. This can be re- tion of Bollen (1989), squares represent observed variables solved by setting the linear structural equation of (say) Yiα and circles, latent variables. SEMs are graphical models to Yiα = Xi + iα. The distribution of the error terms with an emphasis on sparse models where: 1. latent vari- is then identifiable. The distribution of Xi follows from a ables are dependent according to a directed graph model; deconvolution between the observed distribution of an ele- 2. observed variables measure (i.e., are children of) very ment of Yi and the identified distribution of the error term. few latent variables. Although sparse latent variable mod- 2Another consequence of modeling latent dependencies is re- els have been the object of study in and ducing the number of parameters of the model: a SEM with a lin- (e.g., Wood et al. (2006); Zou et al. (2006)), not ear measurement model can be seen as a type of module network (Segal et al., 2005) where the observed children of a particular 1 X For simplicity of presentation, in this paper we adopt a fi- latent Xi share the same nonlinearities propagated from Pi : in nite mixture of Gaussians marginal for the exogenous variables. the context of Figure 1, each indicator Yi ∈ {Y4,...,Y7} has a However, introducing a Dirichlet process mixture of Gaussians conditional expected value of λi0 + λi1f2(X1) for a given X1: marginal is conceptually straightforward. function f2(·) is shared among the indicators of X2. x Identifiability of the joint of X can be resolved by mul- For each Pi , the corresponding Gaussian process prior for tivariate deconvolution under extra assumptions. For in- f 1:N (1) (N) function values i ≡{fi ,...,fi } is stance, Masry (2003) describesone setup for the problemin f 1:N x1:N K the context of kernel density estimation (with known joint i | Pi ∼ N (0, i) distribution of error terms, but unknown joint of Y). where Ki is a N × N kernel matrix (Rasmussen and 1:N Assumptions for the identifiability of functions fi(·), given Williams, 2006), as determined by x . Each correspond- Pi the identifiability of the joint of X , have been discussed ing x(d) is given by f (d) + ζ(d), as in Equation (1). in the literature of error-in-variables regression (Fan and i i i Truong, 1993; Carroll et al., 2004). Error-in-variables re- MCMC can be used to sample from the posterior distri- gression is a special case of our problem, where Xi is ob- bution over latent variables and functions. However, each X sampling step in this model costs 3 , making sampling served but Pi is not. However, since we have Yiα = O(N ) Xi + i, this is equivalent to a error-in-variables regres- very slow when N is at the order of hundreds, and essen- X tially undoable when is in the thousands. As an alter- sion Yiα = fi( Pi )+ iα + ζi, where the compound error N X native, we introduce a multilayered representation adapted term iα + ζi is still independent of Pi . from the pseudo-inputs model of Snelson and Ghahramani It can be shown that such identifiability conditions can be (2006). The goal is to reduce the sampling cost down to exploited in order to identify causal directionality among O(M 2N), M

where Ki;NM is a N × M matrix with each (j, k) ele- Let Xi be an arbitrary latent variable in the graph, with (j) (k) (d) ment given by the kernel function k (x , ¯x ). Simi- latent parents X . We will use X to represent the i Pi i Pi K dth X sampled from the distribution of random vector X, larly, i;M is a M × M matrix where element (j, k) is (d) (d) ¯x(j) ¯x(k) and X indexes its ith component. For instance, X ki( i , i ). It is important to notice that each pseudo- i Pi (d) th X¯ is the d sample of the parents of Xi. A training set of input i , d = 1,...,M, has the same dimensionality as (1) (N) X X¯ size N is represented as {Z ,..., Z }, where Z is the Pi . The motivation for this is that i works as an alter- set of all variables. Lower case x represents fixed val- native training set, with the original prior predictive means 1:N X¯ X ues of latent variables, and x represents a whole set and variances being recovered if M = N and i = Pi . (1) (N) {x ,..., x }. th Let ki;dM be the d row of Ki;NM . Matrix Vi is a diagonal matrix with entry v given by v = 3Notice that if the distribution of the error terms is non- i;dd i;dd x(d) x(d) kT K−1 k . This implies that all la- Gaussian, identification is easier: we only need two unique indi- ki( Pi , Pi ) − i;dM i;M i;dM cators Yiα and Yiβ : since iα, iβ and Xi are mutually indepen- tent function values {f (1),...,f (N)} are conditionally in- dent, identification follows from known results derived in the lit- i i erature of overcomplete independent component analysis (Hoyer dependent. et al., 2008b). 4To see how the Gaussian process networks of Friedman and 3.3 Pseudo-inputs: A Fully Bayesian Formulation Nachman (2000) are a special case of GPSEM-LV, imagine a model where each latent variable is measured without error. That The density function implied by (3) replaces the stan- is, each Xi has at least one observed child Yi such that Yi = Xi. dard Gaussian process prior. In the context of Snelson The measurement model is still linear, but each structural equa- tion among latent variables can be equivalently written in terms and Ghahramani (2006), input and output variables are X observed, and as such Snelson and Ghahramani optimize of the observed variables: i.e., Xi = fi( Pi )+ ζi is equivalent Y ¯x1:M to Yi = fi( Pi )+ ζi, as in Friedman and Nachman. i by maximizing the marginal likelihood of the model. This is practical but sometimes prone to overfitting, since 4 INFERENCE pseudo-inputs are in fact free parameters, and the pseudo- inputs model is best seen as a variation of the Gaussian pro- We use a Metropolis-Hastings scheme to sample from our cess prior rather than an approximation to it (Titsias, 2009). space of latent variablesand parameters. Similarly to Gibbs In our setup, there is limited motivation to optimize the sampling, we sample blocks of random variables while pseudo-inputs since the inputs themselves are random vari- conditioning on the remaining variables. When the corre- ables. For instance, we show in the next section that the sponding conditional distributions are canonical, we sam- cost of sampling pseudo-inputs is no greater than the cost ple directly from them. Otherwise, we use mostly standard of sampling latent variables, while avoiding cumbersome random walk proposals. optimization techniques to choose pseudo-input values. In- Conditioned on the latent variables, sampling the parame- stead we put a prior on the pseudo-inputs and extend the ters of the measurement model is identical to the case of sampling procedure. By conditioning on the data, a good classical Bayesian linear regression. We omit the expres- X placement for the pseudo-inputs can be learned, since Pi sions for simplicity. The same can be said of the sampling X¯ (d) and i are dependent in the posterior. Moreover, it natu- scheme for the posterior variances of each ζi. Sampling the rally provides a protection against overfitting. mixture distribution parameters for the exogenousvariables is also identical to the standard Bayesian case of Gaussian A simple choice of priors for pseudo-inputs is as fol- mixture models, and also omitted. X¯ (d) lows: each pseudo-input i , d = 1,...,M, is given d d We describe the remaining stages of the sampler for the a N (µi , Σi ) prior, independent of all other random vari- ables. A partially informative (empirical) prior can be eas- sparse model. The sampler for the model with full Gaus- ily defined in the case where, for each Xk, we have the sian process priors is simpler and analogous. freedom of choosing a particular indicator Yq with fixed structural equation Yq = Xk + q (see Section 2.1), imply- 4.1 Sampling Latent Functions ing E[Xk] = E[Yq]. This means if Xk is a parent Xi, we d d set the respective entry in µi (recall µi is a vector with an In principle, one can analytically marginalize the pseudo- ¯f 1:M entry for every parent of Xi) to the empirical mean of Yq. functions i . However, keeping an explicit sample of d Each prior covariance matrix Σi is set to be diagonal with the pseudo-functions is advantageous when sampling la- a common variance. (d) tent variables Xi , d = 1,...,N: for each child Xc of Alternatively, we would like to spread the pseudo-inputs a Xi, only the corresponding factor for the conditional den- f (d) priori: other things being equal, pseudo-inputs that are too sity of c needs to be computed (at a O(M) cost), since close to each can be wasteful given their limited number. function values are independent given latent parents and One prior, inspired by space-filling designs from the exper- pseudo-functions. This issue does not arise in the fully- imental design literature (Santner et al., 2003), is observed case of Snelson and Ghahramani (2006), who do marginalize the pseudo-functions.

x1:M D Pseudo-functions and functions {¯f 1:M , f 1:N } are jointly p(¯i ) ∝ det( i) i i Gaussian given all other random variables and data. The ¯f 1:M conditional distribution of i given everything, except the determinant of a kernel matrix Di. We use a squared (1) (N) itself and {fi ,...,fi }, is Gaussian with covariance exponential covariance function with characteristic length matrix scale of 0.1 (Rasmussen and Williams, 2006), and a 4 “nugget” constant that adds 10− to each diagonal term. S¯ K−1 K−1 KT V−1 I K K−1 −1 |X | i ≡ ( i;M + i;M i;NM ( i + /υζi ) i;NM i;M ) This prior has support over a [−L,L] Pi hypercube. We set L to be three times the largest standard deviation of ob- where Vi is defined in Section 3.2 and I is a M × M served variables in the training data. This is the pseudo- identity matrix. The total cost of computing this matrix input prior we adopt in our experiments, where we center is O(NM 2 + M 3)= O(NM 2). The corresponding mean all observed variables at their empirical means. is S¯ K−1 KT V−1 I x1:N i × i;M i;NM ( i + /υζi ) i 3.4 Other Priors x1:N where i is a column vector of length N. ¯f 1:M We adopt standard priors for the parametric components of Given that i is sampled according to this multivariate (1) (N) this model: independent Gaussians for each coefficient λij , Gaussian, we can now sample {fi ,...,fi } in parallel, inverse gamma priors for the variances of the error terms since this becomes a mutually independentset with univari- (d) and a Dirichlet prior for the distribution of the mixture in- ate Gaussian marginals. The conditional variance of fi is 0 dicators of the exogenous variables. vi ≡ 1/(1/vi;dd +1/υζi ), where vi;dd is defined in Section 0 (d) (d) 3.2. The corresponding mean is v (fµ /v + x /υ ), given the remaining variables. We propose each new la- i i;dd i ζi 0 (d) k K−1 ¯f (d) where fµ = i;dM i;M i. tent variable value xi in parallel, and accept or reject it based on a Gaussian random walk proposal centered at the In Section 5, we also sample from the posterior distribution (d) current value xi . We accept the move with probability of the hyperparameters Θi of the kernel function used by 0 K K (d) (d) where, if is not an ex- i;M and i;NM . Plain Metropolis-Hastings is used to min n1,hXi (xi )/hXi (xi )o Xi sample these hyperparameters, using an uniform proposal ogenous variable in the graph, in for . [αΘi, (1/α)Θi] 0 <α< 1 (d) (d) (d) 2 −(xi −fi ) /(2υζi ) hXi (xi ) = e (d) ¯f x (d) 4.2 Sampling Pseudo-inputs and Latent Variables × X ∈X p(fc | c, ¯c, xi ) Q c Ci (d) x(d) × Y ∈Y p(yc | P ) x(d) Q c Ci c We sample each pseudo-input ¯i one at a time, d = (d) X x where Ci is the set of latent children of Xi in the graph, 1, 2,...,M. Recall that ¯i is a vector, with as many and YC is the corresponding set of observed children. entries as the number of parents of Xi. In our implementa- i 0 tion, we propose all entries of the new x¯(d) simultaneously (d) ¯f x (d) i 0 The conditional p(fc | c, ¯c, xi ), which follows from using a Gaussian random walk proposal centered at x¯(d) (d) i (3), is a non-linear function of xi , but crucially does not with the same variance in each dimension and no correla- (·) depend on any xi variable except point d. The evaluation tion structure. For problems where the number of parents of this factor costs O(M 2). As such, sampling all latent of Xi is larger than in our examples (i.e., four or more par- 2 values for Xi takes O(NM ). ents), other proposals might be justified. The case where Xi is an exogenous variable is analogous, (\d) x(d) x(d) Let π¯i (¯i ) be the conditional prior for ¯i given given that we also sample the mixture component indica- x¯(\d), where (\d) ≡{1, 2,...,d−1, d+1,...,M}. Given tors of such variables. i 0 a proposed x¯(d) , we accept the new value with probability i 0 x(d) x(d) 5 EXPERIMENTS min n1,li(¯i )/li(¯i )o where

0 x(d) (\d) x(d) ¯(d) ¯f (\d) x In this evaluation Section5, we briefly illustrate the algo- li(¯i ) =π ¯i (¯i ) × p(fi | i , ¯i) − N −1/2 −(f (d)−k K 1 ¯f )2/(2v ) rithm in a synthetic study, followed by an empirical eval- × v e i i;dM i;M i i;dd Qd=1 i;dd uation on how identifiability matters in order to obtain an ¯(d) ¯f (\d) x interpretable distribution of latent variables. We end this and p(fi | i , ¯i) is the conditional density that fol- k th section with a study comparing the performance our model lows from Equation (3). Row vector i;dM is the d row 6 K K−1 in predictive tasks against common alternatives . of matrix i;NM . Fast submatrix updates of i;M and K K−1 are required in order to calculate l (·) at a i;NM i;M i 5.1 An Illustrative Synthetic Study O(NM) cost, which can be done by standard Cholesky up- 2 dates (Seeger, 2004). The total cost is therefore O(NM ) We generated data from a model of two latent variables for a full sweep over all pseudo-inputs. 2 (X1,X2) where X2 = 4X1 + ζ2, Yi = X1 + i for (d) (\d) ¯ ¯f x 5 The conditional density p(fi | i , ¯i) is known to be MATLAB code to run all of our experiments is available at sharply peaked for moderate sizes of M (at the order of http://www.homepages.ucl.ac.uk/∼ucgtrbd/. hundreds) (Titsias et al., 2009), which may cause mixing 6Some implementation details: we used the squared expo- k x , x a 1 x x 2 problems for the Markov chain. One way to mitigate this nential kernel function ( p q) = exp(− 2b | p − q| )+ 0 0 −4 ¯(d) x(d) 10 δpq , where δpq = 1 is p = q and 0 otherwise. The hyper- effect is to also propose a value fi jointly with ¯i , prior for a is a mixture of a gamma (1, 20) and a gamma (10, 10) which is possible at no additional cost. We propose the with equal probability each. The same (independent) prior is ¯(d) ¯f (\d) x given to b. Variance parameters were given inverse gamma (2, pseudo-function using the conditional p(fi | i , ¯i). The Metropolis-Hastings acceptance probability for this 1) priors, and the linear coefficients were given Gaussian priors 0 with a common large variance of 5. Exogenous latent variables 0 x(d) x(d) variation is then simplified to min n1,li (¯i )/li(¯i )o, were modeled as a mixture of five Gaussians where the mixture where distribution is given a Dirichlet prior with parameter 10. Finally, 0 for each latent Xi variable we choose one of its indicators Yj and 0 x(d) (\d) x(d) fix the corresponding edge coefficient to 1 and intercept to 0 to li (¯i ) =π ¯i (¯i ) − N −1/2 −(f (d)−k K 1 ¯f )2/(2v ) make the model identifiable. We perform 20, 000 MCMC iter- × v e i i;dM i;M i i;dd Qd=1 i;dd ations with a burn-in period of 2000 (only 6000 iterations with 1000 of burn-in for the non-sparse GPSEM-LV due to its high (d) computational cost). Small variations in the priors for coefficients Finally, consider the proposal for latent variables Xi . For (using a variance of 10) and variance parameters (using an inverse each latent variable Xi, the set of latent variable instan- gamma (2, 2)), and a mixture of 3 Gaussians instead of 5, were (1) (2) (N) tiations {Xi ,Xi ,...,Xi } is mutually independent attempted with no significant differences between models. i = 1, 2, 3 and Yi = X2 + i, for i = 4, 5, 6. X1 and all inference algorithm. error terms follow standard Gaussians. Given a sample of An evaluation of the MCMC procedure is done by running 150 points from this model, we set the structural equations and comparing 5 independent chains, each starting from a for Y and Y to have a zero intercept and unit slope for 1 4 different point. Following Lee (2007), we evaluate conver- identifiability purposes. Observed data for Y against Y is 1 4 gence using the EPSR statistic (Gelman and Rubin, 1992), shown in Figure 2(a), which suggests a noisy quadratic re- which compares the variability of a given marginal pos- lationship (plotted in 2(b), but unknown to the model). We terior within each chain and between chains. We calcu- run a GPSEM-LV model with 50 pseudo-inputs. The ex- (d) (d) late this statistic for all latent variables {X1,X2,X3,X4} pected posterior value of each latent pair {X1 ,X2 } for across all 333 data points. d =1,..., 150 is plotted in Figure 2(c). It is clear that we were able to reproduce the original non-linear functional A comparison is done against a variant of the model where relationship given noisy data using a pseudo-inputs model. the measurement model is not sparse: instead, each ob- served variable has all latent variables as parents, and no For comparison, the output of the Gaussian process latent coefficients are fixed. The differences are noticeable and variable model (GPLVM, Lawrence, 2005) with two hid- illustrated in Figure 3. Box-plots of EPSR for the 4 latent den variables is shown in Figure 2(d). GPLVM here as- variables are shown in Figure 4. It is difficult to interpret sumes that the marginal distribution of each latent variable or trust an embedding that is strongly dependent on the ini- is a standard Gaussian, but the measurement model is non- tialization procedure, as it is the case for the unidentifiable parametric. In theory, GPLVM is as flexible as GPSEM- model. As discussed by Palomo et al. (2007), identifiability LV in terms of representing observed joints. However, it might not be a fundamental issue for Bayesian inference, does not learn functional relationships among latent vari- but it is an important practical issue in SEMs. ables, which is often of central interest in SEM applications (Bollen, 1989). Moreover, since no marginal dependence EPSR distribution, consumer data among latent variables is allowed, the model adapts itself to find (unidentifiable) functional relationships between the 3 exogenous latent variables of the true model and the ob- servables, analogous to the case illustrated by Figure 1(b). 2.5 As a result, despite GPLVM being able to depict, as ex- 2 pected, some quadratic relationship (up to a rotation), it is EPSR, 333 points noisier than the one given by GPSEM-LV. 1.5

1 5.2 MCMC and Identifiability 1 2 3 4 X We now explore the effect of enforcing identifiability con- straints on the MCMC procedure. We consider the dataset 7 Figure 4: Boxplots for the EPSR distribution across each of Consumer, a study with 333 university students in Greece the 333 datapoints of each latent variable. Boxes represent (Bartholomew et al., 2008). The aim of the study was to the distribution for the non-sparse model. A value less than identify the factors that affect willingness to pay more to 1.1 is considered acceptable evidence of convergence (Lee, consume environmentally friendly products. We selected 2007), but this essentially never happens. For the sparse 16 indicators of environmental beliefs and attitudes, mea- model, all EPSR statistics were under 1.03. suring a total of 4 hidden variables. For simplicity, we will call these variables X1,...,X4. The structure among la- tents is X1 → X2, X1 → X3, X2 → X3, X2 → X4. Full 5.3 Predictive Verification of the Sparse Model details are given by Bartholomew et al. (2008). We evaluate how well the sparse GPSEM-LV model All observedvariables have a single latent parent in the cor- performs compared against two parametric SEMs and responding DAG. As discussed in Section 2.1, the corre- GPLVM. The linear structural equation model is the SEM, sponding measurement model is identifiable by fixing the where each latent variable is given by a linear combination structural equation for one indicator of each variable to of its parents with additive Gaussian noise. Latent variables have a zero intercept and unit slope (Bartholomew et al., without parents are given the same mixture of Gaussians 2008). If the assumptions described in the references of model as our GPSEM-LV implementation. The quadratic Section 2.1 hold, then the latent functions are also identifi- model includes all quadratic and linear terms, plus first- able. We normalized the dataset before running the MCMC order interactions, among the parents of any given latent 7There was one latent variable marginally independent of ev- variable. This is perhapsthe most common non-linearSEM erything else. We eliminated it and its two indicators, as well as used in practice (Bollen and Paxton, 1998; Lee, 2007). the REC latent variable that had only 1 indicator. GPLVM is fit with 50 active points and the rbf kernel with Observed data (normalized) Latent data (normalized) GPSEM−LV GPLVM 5 5 5 5

4 4 4 4

3 3 3 3

2 2 2 2 1 Y4 X2

1 1 E[X2] 1

mode(X2) 0

0 0 0 −1

−1 −1 −1 −2

−2 −2 −2 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2 −1 0 1 2 3 4 5 Y1 X1 E[X1] mode(X1) (a) (b) (c) (d)

Figure 2: (a) Plot of observed variables Y1 and Y4 generated by adding standard Gaussian noise to two latent variables X1 2 and X2, where X2 =4X1 + ζ2, ζ2 also a standard Gaussian. 150 data points were generated. (b) Plot of the corresponding latent variables, which are not recorded in the data. (c) The posterior expected values of the 150 latent variable pairs according to GPSEM-LV. (d) The posterior modes of the 150 pairs according to GPLVM.

5 independent chains, sparse model 5 chains, unidentifiable model 5 independent chains, sparse model 5 chains, unidentifiable model 4 5 3.5 5

4 3 4 3 2.5 3 3 2 2 2 2 1.5 1 1 1 1 0 0 0.5 −1 0 −1 0 −2 X2, Data point:10 X2, Data point:10 X4, Data point:200 X4, Data point:200 −2 −0.5 −3 −1 −3 −1 −4

−2 −4 −1.5 −5 0 500 1000 1500 2000 2500 0 2000 4000 6000 8000 10000 0 500 1000 1500 2000 2500 0 2000 4000 6000 8000 10000 Iteration Iteration Iteration Iteration

(10) (200) Figure 3: An illustration of the behavior of independent chains for X2 and X4 using two models for the Consumer data: the original (sparse) model (Bartholomew et al., 2008); an (unidentifiable) alternative where the each observed variable is an indicator of all latent variables. In the unidentifiable model, there is no clear pattern across the independent chains. Our model is robust to initialization, while the alternative unidentifiable approach cannot be easily interpreted. automatic relevance determination (Lawrence, 2005). Each Boston8. The corresponding 11 non-binary observed vari- sparse GPSEM model uses 50 pseudo-points. ables that are associated with the given latent concepts are used as indicators. The “Neighborhood” concept was re- We performed a 5-fold cross-validation study where the av- fined into two, “Neighborhood I” and “Neighborhood II” erage predictive log-likelihood on the respective test sets is due to the fact that three of its original indicators have reported. Three datasets are used. The first is the Con- very similar (and highly skewed) marginal distributions, sumer dataset, described in the previous section. which were very dissimilar from the others9. The structure The second is the Abalone data (Asuncion and Newman, among latent variables is given by a fully connected net- 2007), where we postulate two latent variables, “Size” and work directed according to the order {Accessibility, Struc- “Weight.” Size has as indicators the length, diameter and tural, Neighborhood II, Neighborhood I}. Harrison and height of each abalone specimen, while Weight has as indi- Rubinfeld (1978) provide full details on the meaning of the cators the four weight variables. We direct the relationship indicators. We note that it is well known that the Hous- among latent variables as Size → Weight. 8 The third is the Housing dataset (Asuncion and Newman, The analysis by (Harrison and Rubinfeld, 1978, Table IV) 2007; Harrison and Rubinfeld, 1978), which includes in- also included a fourth latent concept of “Air pollution,” which we removed due to the absence of one of its indicators in the elec- dicators about features of suburbs in Boston that are rele- tronic data file that is available. vant for the housing market. Following the original study 9The final set of indicators, using the nomenclature of the (Harrison and Rubinfeld, 1978, Table IV), we postulate UCI repository documentation file, is as follows: “Structural” has three latent variables: “Structural,” corresponding to the as indicators RM and AGE; “Neighborhood I” has as indica- structure of each residence; “Neighborhood,” correspond- tors CRIM, ZN and B; “Neighborhood II” has as indicators INDUS, T AX, PTRATIO and LSTAT ; “Accessibility” has ing to an index of neighborhoodattractiveness; and “Acces- as indicators DIS and RAD. See (Asuncion and Newman, 2007) sibility,” corresponding to an index of accessibility within for detailed information about these indicators. Following Harri- son and Rubinfield, we log-transformed some of the variables: INDUS, DIS, RAD and T AX. ing dataset poses stability problems to density estimation lieve that with larger sample sizes and denser latent struc- due to discontinuities in the variable RAD, one of the indi- tures the non-sparse model should be the best, large sample cators of accessibility (Friedman and Nachman, 2000). In sizes are too expensive to process and, in many SEM appli- order to get more stable results, we use a subset of the data cations, latent variables have very few parents. (374 points) where RAD < 24. It is also important to emphasize that the wallclock sam- The need for non-linear SEMs is well-illustrated by Figure pling time for the non-sparse model was an order of mag- 5, where fantasy samples of latent variables are generated nitude larger than the sparse case with M = 50. The from the predictive distributions of two models. sparse pseudo-inputs model was faster even considering that 3000 training points were used by the sparse model Abalone: Predictive Latent Variables Housing: Predictive Latent Variables 4 14 in the Abalone experiment, against 300 points by the non-

12 3 sparse alternative.

10

2 8 1 6 6 RELATED WORK Weight 4 0

2 Neighborhood I

−1 0 Non-linear has been studied for decades in −2 −2 −4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 10 Size Accessibility the literature . A review is provided by Yalcin and Amemiya (2001). However, most of the clas- sic work is based on simple parametric models. A modern Figure 5: Scatterplots of 2000 fantasy samples taken from approach based on Gaussian processes is the Gaussian pro- the predictive distributions of sparse GPSEM-LV models. cess of Lawrence (2005). By con- In contrast, GPLVM would generate spherical Gaussians. struction, factor analysis cannot be used in applications where one is interested in learning functions relating latent We also evaluate how the non-sparse GPSEM-LV behaves variables, such as in causal inference. For embedding, fac- compared to the sparse alternative. Notice that while Con- tor analysis is easier to use and more robust to model mis- sumer and Housing have each approximately 300 training specification than SEM analysis. Conversely, it does not points in each cross-validation fold, Abalone has over 3000 benefit from well-specified structures and might be harder points. For the non-sparse GPSEM, we subsampled all of to interpret. Bollen (1989) discusses the interplay between Abalone training folds down to 300 samples. factor analysis and SEM. Practical non-linear structural Results are presented in Table 1. Each dataset was cho- equation models are discussed by Lee (2007), but none of sen to represent a particular type of problem. The data in such approaches rely on nonparametric methods. Gaussian Consumer is highly linear. In particular, it is important to processes latent structures appear mostly in the context of point out that the GPSEM-LV model is able to behave as a dynamical systems (e.g., Ko and Fox (2009)). However, the standard structural equation model if necessary, while the connection is typically among data points only, not among quadratic polynomial model shows some overfitting. The variables within a data point, where on-line filtering is the Abalone study is known for having clear functional rela- target application. tionships among variables, as also discussed by Friedman and Nachman (2000). In this case, there is a substantial dif- 7 CONCLUSION ference between the non-linear models and the linear one, although GPLVM seems suboptimal in this scenario where The goal of graphical modeling is to exploit the structure observed variables can be easily clustered into groups. Fi- of real-world problems, but the latent structure is often ig- nally, functional relationships among variables in Housing nored. We introduced a new nonparametric approach for are not as clear (Friedman and Nachman, 2000), with mul- SEMs by extending a sparse Gaussian process prior as a timodal residuals. GPSEM still shows an advantage, but fully Bayesian procedure. Although a standard MCMC all SEMs are suboptimal compared to GPLVM. One ex- algorithm worked reasonably well, it is possible as future planation is that the DAG on which the models rely is not work to study ways of improving mixing times. This can adequate. Structure learning might be necessary to make be particularly relevant in extensions to ordinal variables, the most out of nonparametric SEMs. where the sampling of thresholds will likely make mixing Although results suggest that the sparse model behaved more difficult. Since the bottleneck of the procedure is the better that the non-sparse one (which was true of some sampling of the pseudo-inputs, one might consider a hy- cases found by Snelson and Ghahramani, 2006, due to het- brid approach where a subset of the pseudo-inputs is fixed eroscedasticity effects), such results should be interpreted 10 with care. Abalone had to be subsampled in the non-sparse Another instance of the “whatever you do, some- body in psychometrics already did it long before” law: case. Mixing is harder in the non-sparse model since all http://www.stat.columbia.edu/∼cook/movabletype/archives/ (1) (N) datapoints {Xi ,...,Xi } are dependent. While we be- 2009/01/a longstanding.html Table 1: Average predictive log-likelihood in a 5-fold cross-validation setup. The five methods are the GPSEM-LV model with 50 pseudo-inputs (GPS), GPSEM-LV with standard Gaussian process priors (GP), the linear and quadratic structural equation models (LIN and QDR) and the Gaussian process latent variable model (GPL) of Lawrence (2005), a nonparamet- ric factor analysis model. For Abalone, GP uses a subsample of the training data. The p-values given by a paired Wilcoxon signed-rank test, measuring the significance of positive differences between sparse GPSEM-LV and the quadratic model, are 0.03 (for Consumer), 0.34 (Abalone) and 0.09 (Housing). Consumer Abalone Housing GPS GP LIN QDR GPL GPS GP LIN QDR GPL GPS GP LIN QDR GPL Fold 1 -20.66 -21.17 -20.67 -21.20 -22.11 -1.96 -2.08 -2.75 -2.00 -3.04 -13.92 -14.10 -14.46 -14.11 -11.94 Fold 2 -21.03 -21.15 -21.06 -21.08 -22.22 -1.90 -2.97 -2.52 -1.92 -3.41 -15.07 -17.70 -16.20 -15.12 -12.98 Fold 3 -20.86 -20.88 -20.84 -20.90 -22.33 -1.91 -5.50 -2.54 -1.93 -3.65 -13.66 -15.75 -14.86 -14.69 -12.58 Fold 4 -20.79 -21.09 -20.78 -20.93 -22.03 -1.77 -2.96 -2.30 -1.80 -3.40 -13.30 -15.98 -14.05 -13.90 -12.84 Fold 5 -21.26 -21.76 -21.27 -21.75 -22.72 -3.85 -4.56 -4.67 -3.84 -4.80 -13.80 -14.46 -14.67 -13.71 -11.87

and determined prior to sampling using a cheap heuris- J. Ko and D. Fox. GP-BayesFilters: Bayesian filtering using tic. New ways of deciding pseudo-input locations based Gaussian process prediction and observation models. Au- on a given measurement model will be required. Evalua- tonomous Robots, 2009. tion with larger datasets (at least a few hundred variables) N. D. Lawrence. Probabilistic non-linear principal component remains an open problem. Finally, finding ways of deter- analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783–1816, 2005. mining the graphical structure is also a promising area of research. S.-Y. Lee. Structural Equation Modeling: a Bayesian Approach. Wiley, 2007. E. Masry. Deconvolving multivariate kernel density estimates Acknowledgements from contaminated associated observations. IEEE Transac- tions on Information Theory, 49:2941–2952, 2003. We thank Patrick Hoyer and Ed Snelson for several useful J. Palomo, D. Dunson, and K. Bollen. Bayesian structural equa- discussions, and Irini Moustaki for the consumer data. tion modeling. In Sik-Yum Lee (ed.), Handbook of Latent Vari- able and Related Models, pages 163–188, 2007. References J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000. T. Asparouhov and Bengt Muth´en. Exploratory structural equa- tion modeling. Structural Equation Modeling, 16, 2009. C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. A. Asuncion and D.J. Newman. UCI ml repository, 2007. T. Santner, B. Williams, and W. Notz. The Design and Analysis D. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith. Analysis of Computer Experiments. Springer-Verlag, 2003. of Multivariate Data. Chapman & Hall, 2008. M. Seeger. Low rank updates for the Cholesky decomposition. K. Bollen. Structural Equations with Latent Variables. John Wi- Technical Report, 2004. ley & Sons, 1989. E. Segal, D. Pe’er, A. Regev, D. Koller, and N. Friedman. Learn- K. Bollen and P. Paxton. Interactions of latent variables in struc- ing module networks. JMLR, 6, 2005. tural equation models. Structural Equation Modeling, 5:267– 293, 1998. R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. JMLR, 7, 2006. R. Carroll, D. Ruppert, C. Crainiceanu, T. Tosteson, and M. Kara- gas. Nonlinear and nonparametric regression and instrumental E. Snelson and Z. Ghahramani. Sparse Gaussian processes using variables. JASA, 99, 2004. pseudo-inputs. NIPS, 18, 2006. J. Fan and Y. Truong. Nonparametric regression with errors-in- P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction variables. Annals of Statistics, 21:1900–1925, 1993. and Search. Cambridge University Press, 2000. N. Friedman and I. Nachman. Gaussian process networks. Un- M. Titsias. Variational learning of inducing variables in sparse certainty in Artificial Intelligence, 2000. Gaussian processes. AISTATS, 2009. A. Gelman and D. Rubin. Inference from iterative simulation us- M. Titsias, N. Lawrence, and M. Rattray. Efficient sampling for ing multiple sequences. Statistical Science, 7:457–472, 1992. Gaussian process inference using control variables. Neural In- formation Processing Systems, 2009. D. Harrison and D. Rubinfeld. Hedonic prices and the demand for clean air. J. Environ. Econ. & , 5:81–102, 1978. F. Wood, T. Griffiths, and Z. Ghahramani. A non-parametric Bayesian method for inferring hidden causes. UAI, 2006. P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Sch¨olkopf. Non- linear causal discovery with additive noise models. Neural In- I. Yalcin and Y. Amemiya. Nonlinear factor analysis as a statisti- formation Processing Systems, 2008a. cal method. Statistical Science, 16:275–294, 2001. P. Hoyer, S. Shimizu, A. Kerminen, and M. Palviainen. Estima- H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component tion of causal effects using linear non-Gaussian causal models analysis. Journal of Computational and Graphical Statistics, with hidden variables. IJAR, 49, 2008b. pages 265–286, 2006.