Statistical Science 2012, Vol. 27, No. 4, 519–537 DOI: 10.1214/12-STS391 © Institute of Mathematical Statistics, 2012 Sparse Nonparametric Graphical Models John Lafferty, and Larry Wasserman

Abstract. We present some nonparametric methods for graphical modeling. In the discrete case, where the data are binary or drawn from a finite alpha- bet, Markov random fields are already essentially nonparametric, since the cliques can take only a finite number of values. Continuous data are differ- ent. The Gaussian graphical model is the standard parametric model for con- tinuous data, but it makes distributional assumptions that are often unrealis- tic. We discuss two approaches to building more flexible graphical models. One allows arbitrary graphs and a nonparametric extension of the Gaussian; the other uses kernel density estimation and restricts the graphs to trees and forests. Examples of both methods are presented. We also discuss possible future research directions for nonparametric graphical modeling. Key words and phrases: Kernel density estimation, Gaussian copula, high- dimensional inference, undirected graphical model, oracle inequality, consis- tency.

1. INTRODUCTION striction through the use of copulas; this is a semipara- metric extension of the Gaussian. The other approach This paper presents two methods for constructing uses kernel density estimation and restricts the graphs nonparametric graphical models for continuous data. to trees and forests; in this case the model is fully In the discrete case, where the data are binary or drawn nonparametric, at the expense of structural restrictions. from a finite alphabet, Markov random fields or log- We describe two-step estimation methods for both ap- linear models are already essentially nonparametric, proaches. We also outline some statistical theory for since the cliques can take only a finite number of the methods, and compare them in some examples. values. Continuous data are different. The Gaussian This article is in part a digest of two recent research graphical model is the standard parametric model for articles where these methods first appeared, Liu, Laf- continuous data, but it makes distributional assump- ferty and Wasserman (2009)andLiu et al. (2011). tions that are typically unrealistic. Yet few practical The methods we present here are relatively simple, alternatives to the Gaussian graphical model exist, par- and many more possibilities remain for nonparametric ticularly for high-dimensional data. We discuss two ap- graphical modeling. But as we hope to demonstrate, proaches to building more flexible graphical models a little nonparametricity can go a way. that exploit sparsity. These two approaches are at dif- ferent extremes in the array of choices available. One 2. TWO FAMILIES OF NONPARAMETRIC allows arbitrary graphs, but makes a distributional re- GRAPHICAL MODELS

John Lafferty is Professor, Department of Statistics and The graph of a random vector is a useful way of ex- Department of Computer Science, University of Chicago, ploring the underlying distribution. If X = (X1,..., 5734 S. University Avenue, Chicago, Illinois 60637, USA Xd ) is a random vector with distribution P , then the (e-mail: [email protected]). Han Liu is Assistant undirected graph G = (V, E) corresponding to P con- Professor, Department of Operations Research and sists of a vertex set V andanedgesetE where V has d Financial Engineering, Princeton University, Princeton, elements, one for each variable Xi . The edge between New Jersey 08544, USA (e-mail: [email protected]). (i, j) is excluded from E if and only if Xi is indepen- Larry Wasserman is Professor, Department of Statistics and dent of Xj , given the other variables X\{i,j} ≡ (Xs :1≤ Machine Learning Department,Carnegie Mellon s ≤ d,s = i, j), written University, Pittsburgh Pennsylvania 15213, USA (e-mail: [email protected]). (2.1) Xi  Xj |X\{i,j}.

519 520 J. LAFFERTY, H. LIU AND L. WASSERMAN

Nonparanormal Forest densities Univariate marginals nonparametric nonparametric Bivariate marginals determined by Gaussian copula nonparametric Graph unrestricted acyclic

FIG.1. Comparison of properties of the nonparanormal and forest-structured densities.

The general form for a (strictly positive) probability ing the univariate marginals; in the copula approach, density encoded by an undirected graph G is this is done by estimating the functions fj (x) = −1 1 μj + σj  (Fj (x)),whereFj is the distribution (2.2) p(x) = exp f (x ) , Z(f ) C C function for variable Xj . After estimating each fj , C∈Cliques(G) we transform to (assumed) jointly Normal via Z = where the sum is over all cliques, or fully connected (f1(X1),...,fd (Xd )) and then apply methods for subsets of vertices of the graph. In general, this is what Gaussian graphical models to estimate the graph. In we mean by a nonparametric graphical model.Itisthe this approach, the univariate marginals are fully non- graphical model analog of the general nonparametric parametric, and the sparsity of the model is regu- regression model. Model (2.2) has two main ingredi- lated through the inverse covariance matrix, as for the ents, the graph G and the functions {fC}.However, graphical lasso, or “glasso” (Banerjee, El Ghaoui and without further assumptions, it is much too general to d’Aspremont, 2008; Friedman, Hastie and Tibshirani, be practical. The main difficulty in working with such a 2007).1 The model is estimated in a two-stage proce- model is the normalizing constant Z(f), which cannot, dure; first the functions fj are estimated, and then in- in general, be efficiently computed or approximated. verse covariance matrix is estimated. The high-level In the spirit of nonparametric estimation, we can relationship between linear regression models, Gaus- seek to impose structure on either the graph or the sian graphical models and their extensions to additive functions fC in order to get a flexible and use- and high-dimensional models is summarized in Fig- ful family of models. One approach parallels the ure 2. ideas behind sparse additive models for regression. = In the forest graph approach, we restrict the graph Specifically, we replace the random variable X to be acyclic, and estimate the bivariate marginals (X ,...,X ) by the transformed random variable 1 d p(x ,x ) nonparametrically. In light of equation (4.1), f(X)= (f (X ),...,f (X )), and assume that f(X) i j 1 1 d d this yields the full nonparametric family of graphical is multivariate Gaussian. This results in a nonparamet- models having acyclic graphs. Here again, the estima- ric extension of the Normal that we call the nonpara- tion procedure is two-stage; first the marginals are esti- normal distribution. The nonparanormal depends on mated, and then the graph is estimated. Sparsity is reg- the univariate functions {fj }, and a mean μ and co- variance matrix , all of which are to be estimated ulated through the edges (i, j) that are included in the from data. While the resulting family of distributions forest. is much richer than the standard parametric Normal Clearly these are just two tractable families within (the paranormal), the independence relations among the very large space of possible nonparametric graphi- the variables are still encoded in the precision matrix cal models specified by equation (2.2). Many interest- = −1,asweshowbelow. ing research possibilities remain for novel nonparamet- The second approach is to force the graphical struc- ric graphical models that make different assumptions; ture to be a tree or forest, where each pair of vertices is we discuss some possibilities in a concluding section. connected by at most one path. Thus, we relax the dis- tributional assumption of normality, but we restrict the 1Throughout the paper we use the term graphical lasso, or glasso, allowed family of undirected graphs. The complexity coined by Friedman, Hastie and Tibshirani (2007) to refer to of the model is then regulated by selecting the edges to the solution obtained by 1-regularized log-likelihood under the include, using cross validation. Gaussian graphical model. This estimator goes back at least to and (2007), and an iterative lasso algorithm for doing Figure 1 summarizes the tradeoffs made by these the optimization was first proposed by Banerjee, El Ghaoui and two families of models. The nonparanormal can be d’Aspremont (2008). In our experiments we use the R packages thought of as an extension of additive models for re- glasso (Friedman, Hastie and Tibshirani, 2007)andhuge to im- gression to graphical modeling. This requires estimat- plement this algorithm. SPARSE NONPARAMETRIC GRAPHICAL MODELS 521

Assumptions Dimension Regression Graphical models Parametric low linear model multivariate Normal high lasso graphical lasso Nonparametric low additive model nonparanormal high sparse additive model sparse nonparanormal

FIG.2. Comparison of regression and graphical models. The nonparanormal extends additive models to the graphical model setting. Regularizing the inverse covariance leads to an extension to high dimensions, which parallels sparse additive models for regression.

We now discuss details of these two model families, The form of the density in (3.1) implies that the con- beginning with the nonparanormal. ditional independence graph of the nonparanormal is encoded in = −1, as for the parametric Normal, 3. THE NONPARANORMAL since the density factors with respect to the graph of T We say that a random vector X = (X1,...,Xd ) has , and therefore obeys the global Markov property of a nonparanormal distribution and write the graph. X ∼ NPN(μ,,f) In fact, this is true for any choice of identification restrictions; thus it is not necessary to estimate μ or σ { }d ≡ in case there exist functions fj j=1 such that Z to estimate the graph, as the following result shows. f(X) ∼ N(μ,),wheref(X) = (f1(X1),..., LEMMA 3.1. Define fd (Xd )). When the fj ’s are monotone and differen- −1 tiable, the joint probability density function of X is (3.4) hj (x) =  (Fj (x)), given by and let  be the covariance matrix of h(X). Then Xj  1 −1 p (x) = X |X\{ } if and only if  = 0. X (2π)d/2||1/2 k j,k jk PROOF. We can rewrite the covariance matrix as 1 − (3.1) · exp − f(x)− μ T 1 f(x)− μ 2 jk = Cov(Zj ,Zk) = σj σk Cov(hj (Xj ), hk(Xk)). d  Hence = DD and · |f (xj )|, j −1 −1 −1 −1 j=1 = D  D , where the product term is a Jacobian. where D is the diagonal matrix with diag(D) = σ .The Note that the density in (3.1) is not identifiable—we zero pattern of −1 is therefore identical to the zero could scale each function by a constant, and scale the pattern of −1.  diagonal of in the same way, and not change the den- Figure 3 shows three examples of 2-dimensional sity. To make the family identifiable we demand that fj preserves marginal means and variances. nonparanormal densities. The component functions are = E = E taken to be from three different families of monotonic μj (Zj ) (Xj ) and functions—one using power transforms, one using lo- (3.2) 2 ≡ = = gistic transforms and another using sinusoids. σj jj Var(Zj ) Var(Xj ). α These conditions only depend on diag(), but not the fα(x) = sign(x)|x| , full covariance matrix. 1 Now, let F (x) denote the marginal distribution gα(x) = x + , j 1 + exp{−α(x − x −1/2)} function of Xj . Since the component fj (Xj ) is Gaus- sian, we have that sin(αx) hα(x) = x + . = P ≤ α Fj (x) (Xj x) The covariance in each case is = 10.5 ,andthe fj (x) − μj 0.51 = P Zj ≤ fj (x) =  mean is μ = (0, 0). It can be seen how the concav- σ j ity and number of modes of the density can change which implies that with different nonlinearities. Clearly the nonparanor- −1 (3.3) fj (x) = μj + σj  (Fj (x)). mal family is much richer than the Normal family. 522 J. LAFFERTY, H. LIU AND L. WASSERMAN

α FIG.3. Densities of three 2-dimensional nonparanormals. The left plots have component functions of the form fα(x) = sign(x)|x| , with α1 = 0.9 and α2 = 0.8. The center plots have component functions of the form gα(x) = x +1/(1 + exp(−α(x − x −1/2))) with α1 = 10 = − = + = and α2 5, where x x is the fractional part. The right plots have component functions of the form hα(x) x sin(αx)/α, with α1 5 = = = 10.5 and α2 10. In each case μ (0, 0) and 0.51.

The assumption that f(X)= (f1(X1),...,fd (Xd )) (3.7) = P U1 ≤ F1(x1),...,Ud ≤ Fd (xd ) is Normal leads to a semiparametric model where only (3.8) = C(F (x ),...,F (x )). one-dimensional functions need to be estimated. But 1 1 d d the monotonicity of the functions fj , which map onto This is known as Sklar’s theorem (Sklar, 1959), and C R , enables computational tractability of the nonpara- is called a copula.Ifc is the density function of C,then normal. For more general functions f , the normalizing constant for the density p(x1,...,xd ) (3.9) 1 T −1 d pX(x) ∝ exp − f(x)− μ f(x)− μ = 2 c(F1(x1),...,Fd (xd )) p(xj ), j=1 cannot be computed in closed form. where p(xj ) is the marginal density of Xj . For the non- 3.1 Connection to Copulæ paranormal we have If Fj is the distribution of Xj ,thenUj = Fj (Xj ) is uniformly distributed on (0, 1).LetC denote the joint F(x1,...,xd ) (3.10) U = (U ,...,U ) F − − distribution function of 1 d ,andlet =  ( 1(F (x )), . . . ,  1(F (x ))), denote the distribution function of X.Thenwehave μ, 1 1 d d that where μ, is the multivariate Gaussian cdf, and  is the univariate standard Gaussian cdf. F(x1,...,xd ) (3.5) The Gaussian copula is usually expressed in terms = = P(X1 ≤ x1,...,Xd ≤ xd ) of the correlation matrix, which is given by R diag(σ )−1 diag(σ )−1. Note that the univariate mar- = P F (X ) 1 1 ginal density for a Normal can be written as p(x ) = (3.6) j ≤ ≤ 1 φ(u ) where u = (x − μ )/σ . The multivariate F1(x1),...,Fd (Xd ) Fd (xd ) σj j j j j j SPARSE NONPARAMETRIC GRAPHICAL MODELS 523 Normal density can thus be expressed as However, in this case hj (x) blows up at the largest and smallest values of X(i). For the high-dimensional set- p (x ,...,x ) j μ, 1 d ting where n is small relative to d, an attractive alter- 1 native is to use a truncated or Winsorized2 estimator, (3.11) = d/2 1/2 d ⎧ (2π) |R| = σ j 1 j ⎨ δn, if Fj (x) < δn, = ≤ ≤ − 1 − (3.15) Fj (x) ⎩ Fj (x), if δn Fj (x) 1 δn, · exp − uT R 1u − − 2 (1 δn), if Fj (x) > 1 δn, 1 1 − where δn is a truncation parameter. There is a bias– = exp − uT (R 1 − I)u |R|1/2 2 variance tradeoff in choosing δn; increasing δn in- (3.12) creases the bias while it decreases the variance. d φ(u ) Given this estimate of the distribution of variable X , · j . j σ we then estimate the transformation function fj by j=1 j (3.16) fj (x) ≡ μj + σj hj (x), Since the distribution Fj of the jth variable satisfies Fj (xj ) = ((xj − μj )/σj ) = (uj ),wehavethat where d − (X − μ )/σ =  1(F (X )). The Gaussian copula −1 j j j j j hj (x) =  (Fj (x)) density is thus and μj and σj are the sample mean and standard devi- c(F1(x1),...,Fd (xd )) ation. 1 1 −1 T n n (3.13) = exp −  (F (x)) 1 (i) 1 (i) 2 |R|1/2 2 μ ≡ X and σ = X − μ . j n j j n j j i=1 i=1 −1 −1 · (R − I) (F (x)) , Now, let Sn(f) be the sample covariance matrix of f(X (1)),...,f(X (n));thatis, where n − − − 1  1(F (x)) = ( 1(F (x )),..., 1(F (x ))). S (f) ≡ f X(i) − μ (f) 1 1 d d n n n i=1 This is seen to be equivalent to (3.1) using the chain (3.17) · (i) − T rule and the identity f X μn(f) , n 1 1 (i) −1  = μn(f)≡ f X . (3.14) ( ) (η) − . n φ( 1(η)) i=1 3.2 Estimation We then estimate using Sn(f). For instance, the maximum likelihood estimator is MLE = S (f) −1. Let X(1),...,X(n) beasampleofsizen where n n The 1-regularized estimator is X(i) = (X(i),...,X(i))T ∈ Rd . We’ll design a two-step 1 d = { estimation procedure where first the functions fj are n arg min tr(Sn(f)) estimated, and then the inverse covariance matrix (3.18) is estimated, after transforming to approximately Nor- − log ||+λ 1}, mal. = whereλ is a regularization parameter, and 1 In light of (3.4)wedefine d d | | = j=1 k=1 jk . The estimated graph is then En −1 { = } hj (x) =  (Fj (x)), (j, k) : jk 0 . Thus we use a two-step procedure to estimate the where Fj is an estimator of Fj . A natural candidate for graph: Fj is the marginal empirical distribution function 2 1 n After Charles P. Winsor, the statistician whom John Tukey ≡ credited with his conversion from topology to statistics (Mallows, Fj (t) 1{ (i)≤ }. n Xj t i=1 1990). 524 J. LAFFERTY, H. LIU AND L. WASSERMAN

(1) Replace the observations, for each variable, by estimated at the following rates in the Frobenius norm their respective Normal scores, subject to a Winsorized and the 2-operator norm: truncation. + + 2 (2) Apply the graphical lasso to the transformed (s d)log d log n n − = OP data to estimate the undirected graph. 0 F n1/2 and The first step is noniterative and computationally ef- ficient. The truncation parameter δ is chosen to be n s log d + log2 n − = O , 1 n 0 2 P n1/2 δ = √ (3.19) n 1/4 4n π log n where and does not need to be tuned. As will be shown in s ≡ Card {(i, j) ∈{1,...,d}{1,...,d}| Theorem 3.1, such a choice makes the nonparanormal = = } amenable to theoretical analysis. 0(i, j) 0,i j  is the number of nonzero off-diagonal elements of the 3.3 Statistical Properties of Sn(f) true precision matrix. The main technical result is an analysis of the covari- Using the results of Ravikumar et al. (2009), it can ance of the Winsorized estimator above. In particular, also be shown, under appropriate conditions, that the we show that under appropriate conditions, sparsity pattern of the precision matrix is estimated ac- curately with high probability. In particular, the non- log d + log2 n | − |= paranormal estimator n satisfies max Sn(f)jk Sn(f )jk OP 1/2 , j,k n P(G(n,0)) ≥ 1 − o(1), where Sn(f)jk denotes the (j, k) entry of the matrix G where (n,0) is the event Sn(f). This result allows us to leverage the significant body of theory on the graphical lasso (Rothman et al., sign(n(j, k)) = sign(0(j, k)), ∀j,k ∈{1,...,d} . 2008; Ravikumar et al., 2009) which we apply in step We refer to Liu, Lafferty and Wasserman (2009)forthe two. −1/4 details of the conditions and proofs. These OP (n ) ξ −1/2 THEOREM 3.1. Suppose that d = n , and let f be rates are slower than the OP (n ) rates obtainable the Winsorized estimator defined in (3.16) with δn = for the graphical lasso. However, in more recent work √1 . Define (Liu et al., 2012) we use estimators based on Spear- 4n1/4 π log n man’s rho and Kendall’s tau statistics to obtain the 48 √ parametric rate. C(M,ξ) ≡ √ 2M − 1 (M + 2) πξ 4. FOREST DENSITY ESTIMATION + 2 for M,ξ > 0. Then for any ε ≥ C(M,ξ) log d log n n1/2 We now describe a very different, but equally flexi- and sufficiently large n, we have ble and useful approach. Rather than assuming a trans- formation to normality and an arbitrary undirected P max |Sn(f)jk − Sn(f )jk| >ε graph, we restrict the graph to be a tree or forest, but jk allow arbitrary nonparametric distributions. 1/2 2 ∗ c1d c2d c4n ε Let p (x) be a probability density with respect to ≤ + + c3 exp − , d (1) (n) (nε2)2ξ nMξ−1 log d + log2 n Lebesgue measure μ(·) on R ,andletX ,...,X be n independent identically distributed Rd -valued where c1,c2,c3,c4 are positive constants. ∗ (i) = (i) data vectors sampled from p (x) where X (X1 , (i) X (i) The proof of this result involves a detailed Gaussian ...,Xd ).Let j denote the range of Xj ,andlet tail analysis, and is given in Liu, Lafferty and Wasser- X = X1 ×···×Xd . man (2009). A graph is a forest if it is acyclic. If F is a d-node Using Theorem 3.1 and the results of Rothman et al. undirected forest with vertex set VF ={1,...,d} and (2008), it can then be shown that the precision matrix is edge set EF ⊂{1,...,d}×{1,...,d}, the number of SPARSE NONPARAMETRIC GRAPHICAL MODELS 525

∗ ∗ edges satisfies |EF |

Of course, the above procedure is not practical since where h1 > 0 is the univariate bandwidth. the true density p∗(x) is unknown. We replace the We assume that the data lie in a d-dimensional unit d population mutual information I(Xi; Xj ) in (4.8)by cube X =[0, 1] . To calculate the empirical mutual in- ; ; a plug-in estimate In(Xi Xj ),definedas formation In1 (Xi Xj ), we need to numerically eval- uate a two-dimensional integral. To do so, we calcu- In(Xi; Xj ) = pn(xi,xj ) late the kernel density estimates on a grid of points. Xi ×Xj (4.12) We choose m evaluation points on each dimension, pn(xi,xj ) x1i 0asthe For k = 1,...,d− 1: bandwidth parameter. The univariate kernel density es- (k) (k) (1) Set (i ,j ) ← arg max Mn (i, j) such that timate p (x ) for X is (i,j) 1 n1 k k E(k−1) ∪{(i(k),j(k))} does not contain a cycle; (s) − (2) E(k) ← E(k−1) ∪{(i(k),j(k))}. 1 1 Xk xk (4.14) p (x ) = K , (d−1) (d−1) n1 k Output: tree F with edge set E . n1 h1 h1 n1 s∈D1 SPARSE NONPARAMETRIC GRAPHICAL MODELS 527

→∞ Let pn2 (xi,xj ) and pn2 (xk) be defined as in (4.13) In fact, asymptotically (as n2 ) this gives an op- and (4.14), but now evaluated solely based on the held- timal tree-based estimator constructed in terms of the D out data in 2. For a density pF that is supported by a kernel density estimates pn1 . forest F , we define the held-out negative log-likelihood risk as 4.2 Statistical Properties The statistical properties of the forest density estima- Rn2 (pF ) tor can be analyzed under the same type of assumptions =− pn2 (xi,xj ) that are made for classical kernel density estimation. Xi ×Xj (i,j)∈EF In particular, assume that the univariate and bivariate (4.16) densities lie in a Hölder class with exponent β. Under p(xi,xj ) · log dxi dxj this assumption the minimax rate of convergence in the p(xi)p(xj ) + squared error loss is O(nβ/(β 1)) for bivariate densities + − and O(n2β/(2β 1)) for univariate densities. Technical pn2 (xk) log p(xk)dxk. Xk ∞ k∈VF assumptions on the kernel yield L concentration re- sults on kernel density estimation (Giné and Guillou, (k) The selected forest is then Fn1 where 2002). = Choose the bandwidths h1 and h2 to be used in the (4.17) k arg min Rn2 (pF (k) ) k∈{0,...,d−1} n1 one-dimensional and two-dimensional kernel density and where p (k) is computed using the density estimate estimates according to Fn1 D 1/(1+2β) pn1 constructed on 1. log n (4.22) h  , We can also estimate k as 1 n 1 + k = arg max log n 1/(2 2β) k∈{0,...,d−1} n2 (4.23) h  . 2 n p (X(s),X(s)) · n1 i j This choice of bandwidths ensures the optimal rate of (4.18) log (s) (s) (k) ∈D ∈ pn (X )pn (X ) convergence. Let P be the family of d-dimensional s 2 (i,j) EF (k) 1 i 1 j d densities that are supported by forests with at most k · (s) edges. Then pn1 X ∈V − F (k) P(0) ⊂ P(1) ⊂···⊂P(d 1) (4.24) d d d . 1 = arg max Due to this nesting property, k∈{0,...,d−1} n2 (4.19) ≥ (s) (s) inf R(qF ) inf R(qF ) pn1 (Xi ,Xj ) ∈P(0) ∈P(1) · log . qF d qF d (s) (s) (4.25) s∈D (i,j)∈E pn1 (Xi )pn1 (Xj ) 2 F (k) ≥···≥ inf R(qF ). ∈P(d−1) This minimization can be efficiently carried out by it- qF d − (d−1) erating over the d 1 edges in Fn1 . This means that a full spanning tree would generally be Once k is obtained, the final forest-based kernel den- selected if we had access to the true distribution. How- sity estimate is given by ever, with access to finite data to estimate the densities p (x ,x ) (pn ), the optimal procedure is to use fewer than d − 1 (4.20) p (x) = n1 i j p (x ). 1 n p (x )p (x ) n1 k edges. The following result analyzes the excess risk re- (k) n1 i n1 j k (i,j)∈E sulting from selecting the forest based on the heldout Another alternative is to compute a maximum weight risk Rn2 . spanning forest, using Kruskal’s algorithm, but with THEOREM 4.1. Let p(k) be the estimate with held-out edge weights Fd | (k) |= obtained after the first iterations of the (s) (s) EF k k 1 pn (X ,X ) d = 1 i j Chow–Liu algorithm. Then under (omitted) techni- (4.21) wn2 (i, j) log (s) (s) . n2 s∈D2 pn1 (Xi )pn1 (Xj ) cal assumptions on the densities and kernel, for any 528 J. LAFFERTY, H. LIU AND L. WASSERMAN

1 ≤ k ≤ d − 1, where (4.33) follows from the fact that k is the mini- mizer of R (·). This result allows the dimension d to − n2 R(pF(k) ) inf R(qF ) + d ∈P(k) increase at a rate o( n2β/(1 2β)/ log n), and the num- qF d (4.26) ber of edges k to increase at a rate o( nβ/(1+β)/ log n), log n + log d log n + log d = + with the excess risk still decreasing to zero asymptoti- OP k β/(1+β) d 2β/(1+2β) n n cally. and Note that the minimax rate for 2-dimensional ker- nel density estimation under our stated conditions is − −β/(β+1) R(p (k)) min R(pF(k) ) n . The rate above is essentially the square root Fd 0≤k≤d−1 d of this rate, up to logarithmic factors. This is because a ∗ log n + log d higher order kernel is used, which may result in nega- (4.27) = O (k + k) P nβ/(1+β) tive values. Once we correct these negative values, the resulting estimated density will no longer integrate to log n + log d one. The slower rate is due to a very simple truncation + d , n2β/(1+2β) technique to correct the higher-order kernel density es- timator to estimate mutual information. Current work = ∗ = where k arg min0≤k≤d−1 Rn2 (p (k) ) and k is investigating a different version of the higher order Fd kernel density estimator with more careful correction arg min0≤k≤d−1 R(pF(k) ). d techniques, for which it is possible to achieve the opti- The main work in proving this result lies in estab- mal minimax rate. lishing bounds such as In theory the bandwidths are chosen as in (4.22)and (4.23), assuming β is known. In our experiments pre- sup |R(p ) − R (p )| F n2 F sented below, the bandwidth h for the 2-dimensional ∈F(k) k F d kernel density estimator is chosen according to the (4.28) = + Normal reference rule OP φn(k) ψn(d) , q − q = · k,0.75 k,0.25 where Rn2 is the held-out risk, under the notation hk 1.06 min σk, 1.34 (4.35) − + log n + log d · n 1/(2β 2), (4.29) φ (k) = k , n nβ/(β+1) where σk is the sample standard deviation of (s) log n + log d {X }s∈D ,andqk,0.75, qk,0.25 are the 75% and 25% (4.30) ψ (d) = d . k 1 n 2β/(1+2β) { (s)} = n sample quantiles of Xk s∈D1 , with β 2. See Wasserman (2006) for a discussion of this choice of For the proof of this and related results, see Liu et al. bandwidth. (2011). Using this, one easily obtains

∗ R(p (k)) − R(p (k ) ) 5. EXAMPLES Fd Fd 5.1 Gene–Gene Interaction Graphs (4.31) = R(p ) − R (p ) F(k) n2 F(k) d d The nonparanormal and Gaussian graphical model + − ∗ Rn2 (p (k)) R(p (k ) ) can construct very different graphs. Here we consider Fd Fd a data set based on Affymetrix GeneChip microar- + = OP φn(k) ψn(d) rays for the plant Arabidopsis thaliana (Wille et al., (4.32) 2004) (see Figure 4).Thesamplesizeisn = 118. The + − ∗ Rn2 (p (k)) R(p (k ) ) Fd Fd expression levels for each chip are pre-processed by log-transformation and standardization. A subset of 40 ≤ OP φn(k) + ψn(d) (4.33) genes from the isoprenoid pathway is chosen for anal- + ∗ − ∗ ysis. Rn2 (p (k ) ) R(p (k ) ) Fd Fd While these data are often treated as multivariate ∗ (4.34) = OP φn(k) + φn(k ) + ψn(d) , Gaussian, the nonparanormal and the glasso give very SPARSE NONPARAMETRIC GRAPHICAL MODELS 529

ferences. In particular, variables become nonzero in a different order. Figure 6 compares the estimated graphs for the two methods at several values of the regularization param- eter λ in the range [0.16, 0.37]. For each λ,weshow the estimated graph from the nonparanormal in the first column. In the second column we show the graph ob- tained by scanning the full regularization path of the glasso fit and finding the graph having the smallest symmetric difference with the nonparanormal graph. The symmetric difference graph is shown in the third column. The closest glasso fit is different, with edges selected by the glasso not selected by the nonpara- normal, and vice-versa. The estimated transformation functions for several genes are shown Figure 7,which show non-Gaussian behavior. Since the graphical lasso typically results in a large FIG. 4. Arabidopsis thaliana is a small flowering plant; it was the parameter bias as a consequence of the  regu- first plant genome to be sequenced, and its roughly 27,000 genes 1 and 35,000 proteins have been actively studied. Here we consider larization, it sometimes make sense to use the refit a data set based on Affymetrix GeneChip microarrays with sam- glasso, which is a two-step procedure—in the first step, ple size n = 118, for which d = 40 genes have been selected for a sparse inverse covariance matrix is obtained by the analysis. graphical lasso; in the second step, a Gaussian model is refit without 1 regularization, but enforcing the spar- sity pattern obtained in the first step. different graphs over a wide range of regularization Figure 8 compares forest density estimation to the parameters, suggesting that the nonparametric method graphical lasso and refit glasso. It can be seen that the forest-based kernel density estimator has better gener- could lead to different biological conclusions. alization performance. This is not surprising, given that The regularization paths of the two methods are the true distribution of the data is not Gaussian. (Note compared in Figure 5. To generate the paths, we se- that since we do not directly compute the marginal uni- lect 50 regularization parameters on an evenly spaced variate densities in the nonparanormal, we are unable grid in the interval [0.16, 1.2]. Although the paths for to compute likelihoods under this model.) The held- the two methods look similar, there are some subtle dif- out log-likelihood curve for forest density estimation

FIG.5. Regularization paths of both methods on the microarray data set. Although the paths for the two methods look similar, there are some subtle differences. 530 J. LAFFERTY, H. LIU AND L. WASSERMAN

FIG.6. The nonparanormal estimated graph for three values of λ = 0.2448, 0.2661, 0.30857 (left column), the closest glasso estimated graph from the full path (middle) and the symmetric difference graph (right).

FIG.7. Estimated transformation functions for four genes in the microarray data set, indicating non-Gaussian marginals. The correspond- ing genes are among the nodes appearing in the symmetric difference graphs above. SPARSE NONPARAMETRIC GRAPHICAL MODELS 531

FIG.8. Results on microarray data. Top: held-out log-likelihood of the forest density estimator (black step function), glasso (red stars) and refit glasso (blue circles). Bottom: estimated graphs using the forest-based estimator (left) and the glasso (right), using the same node layout.

achieves a maximum when there are only 35 edges With St,j denoting the closing price of stock j on day in the model. In contrast, the held-out log-likelihood t, we consider the variables Xtj = log(St,j/St−1,j ) and curves of the glasso and refit glasso achieve maxima build graphs over the indices j. We simply treat the in- when there are around 280 edges and 100 edges re- stances Xt as independent replicates, even though they spectively, while their predictive estimates are still in- form a time series. The data contain many outliers; ferior to those of the forest-based kernel density esti- the reasons for these outliers include splits in a stock, mator. Figure 8 also shows the estimated graphs for the which increases the number of shares. We Winsorize forest-based kernel density estimator and the graphical (or truncate) every stock so that its data points are lasso. The graphs are automatically selected based on within three times the mean absolute deviation from held-out log-likelihood, and are clearly different. the sample average. The importance of this Winsoriza- 5.2 Graphs for Equities Data tion is shown below; see the “snake graph” in Fig- For the examples in this section we collected stock ure 10. For the following results we use the subset of price data from Yahoo! Finance (finance.yahoo.com). the data between January 1, 2003 to January 1, 2008, The daily closing prices were obtained for 452 stocks before the onset of the “financial crisis.” It is interest- that consistently were in the S&P 500 index between ing to compare to results that include data after 2008, January 1, 2003 through January 1, 2011. This gave but we omit these for brevity. us altogether 2015 data points, each data point corre- The 452 stocks are categorized into 10 Global Indus- sponds to the vector of closing prices on a trading day. try Classification Standard (GICS) sectors, including 532 J. LAFFERTY, H. LIU AND L. WASSERMAN

Target Corp. (Consumer Discr.) Big Lots, Inc. (Consumer Discr.) Costco Co. (Consumer Staples) Yahoo Inc. (Information Tech.) Family Dollar Stores (Consumer Discr.) Amazon.com Inc. (Consumer Discr.) Kohl’s Corp. (Consumer Discr.) eBay Inc. (Information Tech.) Lowe’s Cos. (Consumer Discr.) NetApp (Information Tech.) Macy’s Inc. (Consumer Discr.) Wal-Mart Stores (Consumer Staples)

FIG.9. Example neighborhoods in a forest graph for two stocks, Yahoo Inc. and Target Corp. The corresponding GICS industries are shown in parentheses.(Consumer Discr. is short for Consumer Discretionary, and Information Tech. is short for In- formation Technology.)

Consumer Discretionary (70 stocks), Con- the graph is anomolous, with a snake-like character sumer Staples (35 stocks), Energy (37 stocks), that weaves in and out of the 10 GICS industries. In- Financials (74 stocks), Health Care (46 tuitively, the outliers make the two-dimensional den- stocks), Industrials (59 stocks), Information sities appear like thin “pancakes,” and densities with Technology (64 stocks), Materials (29 stocks), similar orientations are clustered together. To address Telecommunications Services (6 stocks), this, we trim the outliers by Winsorizing at 3 MADs, and Utilities (32 stocks). In the graphs shown as described above. Figure 10(c) shows the estimated below, the nodes are colored according to the GICS forest graph, restricted to the same stocks shown for sector of the corresponding stock. It is expected that the graphs in (a) and (b). The resulting graph has good stocks from the same GICS sectors should tend to be clustering with respect to the GICS sectors. clustered together, since stocks from the same GICS Figures 11(a)–(c) display the differences and edges sector tend to interact more with each other. This is in- common to the glasso, nonparanormal and forest deed this case; for example, Figure 9 shows examples graphs. Figure 11(a) shows the symmetric differ- of the neighbors of two stocks, Yahoo Inc. and Target ence between the estimated glasso and nonparanormal Corp., in the forest density graph. graphs, and Figure 11(b) shows the common edges. Figures 10(a)–(c) show graphs estimated using the Figure 11(c) shows the symmetric difference between glasso, nonparanormal, and forest density estimator on the nonparanormal and forest graphs, and Figure 11(d) the data from January 1, 2003 to January 1, 2008. There shows the common edges. are altogether n = 1257 data points and d = 452 di- mensions. To estimate the glasso graph, we somewhat We refrain from drawing any hard conclusions about arbitrarily set the regularization parameter to λ = 0.55, the effectiveness of the different methods based on which results in a graph that has 1316 edges, about these plots—how these graphs are used will depend 3 neighbors per node, and good clustering structure. on the application. These results serve mainly to high- The resulting graph is shown in Figure 10(a). The light how very different inferences about the indepen- corresponding nonparanormal graph is shown in Fig- dence relations can arise from moving from a Gaussian ure 10(b). The regularization is chosen so that it too model to a semiparametric model to a fully nonpara- has 1316 edges. Only nodes that have neighbors in one metric model with restricted graphs. of the graphs are shown; the remaining nodes are dis- connected. 6. RELATED WORK Since our dataset contains n = 1257 data points, we directly apply the forest density estimator on the whole There is surprisingly little work on structure learn- dataset to obtain a full spanning tree of d − 1 = 451 ing of nonparametric graphical models in high dimen- edges. This estimator turns out to be very sensitive sions. One piece of related work is sparse log-density to outliers, since it exploits kernel density estimates smoothing spline ANOVA models, introduced by Jeon as building blocks. In Figure 10(d) we show the es- and Lin (2006). In such a model the log-density func- timated forest density graph on the stock data when tion is decomposed as the sum of a constant term, one- outliers are not trimmed by Winsorization. In this case dimensional functions (main effects), two-dimensional SPARSE NONPARAMETRIC GRAPHICAL MODELS 533

FIG. 10. Graphs build on S&P 500 stock data from Jan. 1, 2003 to Jan. 1, 2008. The graphs are estimated using (a) the glasso,(b)the nonparanormal and (c) forest density estimation. The nodes are colored according to their GICS sector categories. Nodes are not shown that have zero neighbors in both the glasso and nonparanormal graphs. Figure (d) shows the maximum weight spanning tree that results if the data are not Winsorized to trim outliers. 534 J. LAFFERTY, H. LIU AND L. WASSERMAN

FIG. 11. Visualizations of the differences and similarities between the estimated graphs. The symmetric difference between the glasso and nonparanormal graphs is shown in (a), and the edges common to the graphs are shown in (b). Similarly, the symmetric difference between the nonparanormal and forest density estimate is shown in (c), and the common edges are shown in (d). SPARSE NONPARAMETRIC GRAPHICAL MODELS 535 functions (two-way interactions) and so on. junction tree structure that maximizes the likelihood is usually computationally expensive. Schwaighofer et al. log p(x) = f(x) (2007) propose a forward–backward strategy for non- (6.1) d parametric structure learning. However, such a greedy ≡ + + +··· c fj (xj ) fjk(xj ,xk) . procedure does not guarantee that the global optimal j=1 j

copula graphical models. In Proceedings of the 29th Inter- SCHWAIGHOFER,A.,DEJORI,M.,TRESP,V.andSTETTER,M. national Conference on Machine Learning (ICML-12). ACM, (2007). Structure learning with nonparametric decomposable New York. models. In Proceedings of the 17th International Conference on MALLOWS, C. L., ed. (1990). The Collected Works of John W. Artificial Neural Networks. ICANN’07. Elsevier, New York. Tukey. Vol. VI: More Mathematical: 1938–1984. Wadsworth & SKLAR, M. (1959). Fonctions de répartition à n dimensions et leurs Brooks/Cole Advanced Books & Software, Pacific Grove, CA. marges. Publ. Inst. Statist. Univ. Paris 8 229–231. MR0125600 MR1057793 WASSERMAN, L. (2006). All of Nonparametric Statistics. MEINSHAUSEN,N.andBÜHLMANN, P. (2006). High- Springer, New York. MR2172729 dimensional graphs and variable selection with the lasso. WILLE,A.,ZIMMERMANN,P.,VRANOVÁ,E.,FÜRHOLZ,A., Ann. Statist. 34 1436–1462. MR2278363 LAULE,O.,BLEULER,S.,HENNIG,L.,PRELIC´ ,A.,VON RAVIKUMAR,P.,WAINWRIGHT,M.J.andLAFFERTY,J.D. ROHR,P.,THIELE,L.,ZITZLER,E.,GRUISSEM,W.and (2010). High-dimensional Ising model selection using 1- BÜHLMANN, P. (2004). Sparse Gaussian graphical mod- regularized logistic regression. Ann. Statist. 38 1287–1319. elling of the isoprenoid gene network in Arabidopsis thaliana. MR2662343 Genome Biology 5 R92. RAVIKUMAR,P.,WAINWRIGHT,M.,RASKUTTI,G.andYU,B. YUAN,M.andLIN, Y. (2007). Model selection and estima- (2009). Model selection in Gaussian graphical models: High- tion in the Gaussian graphical model. Biometrika 94 19–35. dimensional consistency of  -regularized MLE. In Advances 1 MR2367824 in Neural Information Processing Systems, 22 MIT Press, Cam- ZHOU,S.,LAFFERTY,J.andWASSERMAN, L. (2010). Time vary- bridge, . ing undirected graphs. Mach. Learn. 80 295–329. ROTHMAN,A.J.,BICKEL,P.J.,LEVINA,E.andZHU, J. (2008). Sparse permutation invariant covariance estimation. Electron. J. Stat. 2 494–515. MR2417391