Sparse Nonparametric Graphical Models John Lafferty, Han Liu and Larry Wasserman
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Science 2012, Vol. 27, No. 4, 519–537 DOI: 10.1214/12-STS391 © Institute of Mathematical Statistics, 2012 Sparse Nonparametric Graphical Models John Lafferty, Han Liu and Larry Wasserman Abstract. We present some nonparametric methods for graphical modeling. In the discrete case, where the data are binary or drawn from a finite alpha- bet, Markov random fields are already essentially nonparametric, since the cliques can take only a finite number of values. Continuous data are differ- ent. The Gaussian graphical model is the standard parametric model for con- tinuous data, but it makes distributional assumptions that are often unrealis- tic. We discuss two approaches to building more flexible graphical models. One allows arbitrary graphs and a nonparametric extension of the Gaussian; the other uses kernel density estimation and restricts the graphs to trees and forests. Examples of both methods are presented. We also discuss possible future research directions for nonparametric graphical modeling. Key words and phrases: Kernel density estimation, Gaussian copula, high- dimensional inference, undirected graphical model, oracle inequality, consis- tency. 1. INTRODUCTION striction through the use of copulas; this is a semipara- metric extension of the Gaussian. The other approach This paper presents two methods for constructing uses kernel density estimation and restricts the graphs nonparametric graphical models for continuous data. to trees and forests; in this case the model is fully In the discrete case, where the data are binary or drawn nonparametric, at the expense of structural restrictions. from a finite alphabet, Markov random fields or log- We describe two-step estimation methods for both ap- linear models are already essentially nonparametric, proaches. We also outline some statistical theory for since the cliques can take only a finite number of the methods, and compare them in some examples. values. Continuous data are different. The Gaussian This article is in part a digest of two recent research graphical model is the standard parametric model for articles where these methods first appeared, Liu, Laf- continuous data, but it makes distributional assump- ferty and Wasserman (2009)andLiu et al. (2011). tions that are typically unrealistic. Yet few practical The methods we present here are relatively simple, alternatives to the Gaussian graphical model exist, par- and many more possibilities remain for nonparametric ticularly for high-dimensional data. We discuss two ap- graphical modeling. But as we hope to demonstrate, proaches to building more flexible graphical models a little nonparametricity can go a long way. that exploit sparsity. These two approaches are at dif- ferent extremes in the array of choices available. One 2. TWO FAMILIES OF NONPARAMETRIC allows arbitrary graphs, but makes a distributional re- GRAPHICAL MODELS John Lafferty is Professor, Department of Statistics and The graph of a random vector is a useful way of ex- Department of Computer Science, University of Chicago, ploring the underlying distribution. If X = (X1,..., 5734 S. University Avenue, Chicago, Illinois 60637, USA Xd ) is a random vector with distribution P , then the (e-mail: [email protected]). Han Liu is Assistant undirected graph G = (V, E) corresponding to P con- Professor, Department of Operations Research and sists of a vertex set V andanedgesetE where V has d Financial Engineering, Princeton University, Princeton, elements, one for each variable Xi . The edge between New Jersey 08544, USA (e-mail: [email protected]). (i, j) is excluded from E if and only if Xi is indepen- Larry Wasserman is Professor, Department of Statistics and dent of Xj , given the other variables X\{i,j} ≡ (Xs :1≤ Machine Learning Department,Carnegie Mellon s ≤ d,s = i, j), written University, Pittsburgh Pennsylvania 15213, USA (e-mail: [email protected]). (2.1) Xi Xj |X\{i,j}. 519 520 J. LAFFERTY, H. LIU AND L. WASSERMAN Nonparanormal Forest densities Univariate marginals nonparametric nonparametric Bivariate marginals determined by Gaussian copula nonparametric Graph unrestricted acyclic FIG.1. Comparison of properties of the nonparanormal and forest-structured densities. The general form for a (strictly positive) probability ing the univariate marginals; in the copula approach, density encoded by an undirected graph G is this is done by estimating the functions fj (x) = −1 1 μj + σj (Fj (x)),whereFj is the distribution (2.2) p(x) = exp f (x ) , Z(f ) C C function for variable Xj . After estimating each fj , C∈Cliques(G) we transform to (assumed) jointly Normal via Z = where the sum is over all cliques, or fully connected (f1(X1),...,fd (Xd )) and then apply methods for subsets of vertices of the graph. In general, this is what Gaussian graphical models to estimate the graph. In we mean by a nonparametric graphical model.Itisthe this approach, the univariate marginals are fully non- graphical model analog of the general nonparametric parametric, and the sparsity of the model is regu- regression model. Model (2.2) has two main ingredi- lated through the inverse covariance matrix, as for the ents, the graph G and the functions {fC}.However, graphical lasso, or “glasso” (Banerjee, El Ghaoui and without further assumptions, it is much too general to d’Aspremont, 2008; Friedman, Hastie and Tibshirani, be practical. The main difficulty in working with such a 2007).1 The model is estimated in a two-stage proce- model is the normalizing constant Z(f), which cannot, dure; first the functions fj are estimated, and then in- in general, be efficiently computed or approximated. verse covariance matrix is estimated. The high-level In the spirit of nonparametric estimation, we can relationship between linear regression models, Gaus- seek to impose structure on either the graph or the sian graphical models and their extensions to additive functions fC in order to get a flexible and use- and high-dimensional models is summarized in Fig- ful family of models. One approach parallels the ure 2. ideas behind sparse additive models for regression. = In the forest graph approach, we restrict the graph Specifically, we replace the random variable X to be acyclic, and estimate the bivariate marginals (X ,...,X ) by the transformed random variable 1 d p(x ,x ) nonparametrically. In light of equation (4.1), f(X)= (f (X ),...,f (X )), and assume that f(X) i j 1 1 d d this yields the full nonparametric family of graphical is multivariate Gaussian. This results in a nonparamet- models having acyclic graphs. Here again, the estima- ric extension of the Normal that we call the nonpara- tion procedure is two-stage; first the marginals are esti- normal distribution. The nonparanormal depends on mated, and then the graph is estimated. Sparsity is reg- the univariate functions {fj }, and a mean μ and co- variance matrix , all of which are to be estimated ulated through the edges (i, j) that are included in the from data. While the resulting family of distributions forest. is much richer than the standard parametric Normal Clearly these are just two tractable families within (the paranormal), the independence relations among the very large space of possible nonparametric graphi- the variables are still encoded in the precision matrix cal models specified by equation (2.2). Many interest- = −1,asweshowbelow. ing research possibilities remain for novel nonparamet- The second approach is to force the graphical struc- ric graphical models that make different assumptions; ture to be a tree or forest, where each pair of vertices is we discuss some possibilities in a concluding section. connected by at most one path. Thus, we relax the dis- tributional assumption of normality, but we restrict the 1Throughout the paper we use the term graphical lasso, or glasso, allowed family of undirected graphs. The complexity coined by Friedman, Hastie and Tibshirani (2007) to refer to of the model is then regulated by selecting the edges to the solution obtained by 1-regularized log-likelihood under the include, using cross validation. Gaussian graphical model. This estimator goes back at least to Yuan and Lin (2007), and an iterative lasso algorithm for doing Figure 1 summarizes the tradeoffs made by these the optimization was first proposed by Banerjee, El Ghaoui and two families of models. The nonparanormal can be d’Aspremont (2008). In our experiments we use the R packages thought of as an extension of additive models for re- glasso (Friedman, Hastie and Tibshirani, 2007)andhuge to im- gression to graphical modeling. This requires estimat- plement this algorithm. SPARSE NONPARAMETRIC GRAPHICAL MODELS 521 Assumptions Dimension Regression Graphical models Parametric low linear model multivariate Normal high lasso graphical lasso Nonparametric low additive model nonparanormal high sparse additive model sparse nonparanormal FIG.2. Comparison of regression and graphical models. The nonparanormal extends additive models to the graphical model setting. Regularizing the inverse covariance leads to an extension to high dimensions, which parallels sparse additive models for regression. We now discuss details of these two model families, The form of the density in (3.1) implies that the con- beginning with the nonparanormal. ditional independence graph of the nonparanormal is encoded in = −1, as for the parametric Normal, 3. THE NONPARANORMAL since the density factors with respect to the graph of T We say that a random vector X = (X1,...,Xd ) has , and therefore obeys the global Markov property of a nonparanormal distribution and write the graph. X ∼ NPN(μ,,f) In fact, this is true for any choice of identification restrictions; thus it is not necessary to estimate μ or σ { }d ≡ in case there exist functions fj j=1 such that Z to estimate the graph, as the following result shows. f(X) ∼ N(μ,),wheref(X) = (f1(X1),..., LEMMA 3.1. Define fd (Xd )). When the fj ’s are monotone and differen- −1 tiable, the joint probability density function of X is (3.4) hj (x) = (Fj (x)), given by and let be the covariance matrix of h(X).