Nonparametric Graph Estimation Han Liu Department of Opera-ons Research and Financial Engineering Princeton University Acknowledgement Fang Han John Lafferty Larry Wasserman Tuo Zhao JHU Biostats Chicago CS/Stats CMU Stats/ML JHU CS http:// www.princeton.edu/~hanliu 2 High Dimensional Data Analysis The dimensionality d increases with the sample size n Approximation Error + Estimation Error + Computing Error This talk Well studied under linear and Gaussian models A little nonparametricity goes a long way 3 Graph Estimation Problem Infer conditional independence based on observational data d variables X1,…, Xd G = (V,E) n x , ⎛ 1 1 ⎞ … , ⎜ x … x ⎟ 1 1 d ⎟ Xi X j x ⎜ ⎟ s ⎜ ⎟ e ⇒ l ⎜ ⎟ p ⎜ ⎟ m ⎜ ⎟ a n n s ⎜ ⎟ x x ⎟ n ⎝⎜ 1 d ⎠⎟ (X , X )∉ E ⇔ X ⊥ X | the rest i j i j Applications: density estimation, computing, visualization... 4 Desired Statistical Properties Characterize the performance using different criteria ˆ o Persistency: Risk(f)-Risk(f ) = oP (1) F Model class ˆ * Consistency: Distance(f , f ) = oP (1) oracle esmator o f ˆ * ˆ Sparsistency: P(graph(f) ≠ graph(f ))= o(1) f f * Minimax optimality true func-on 5 Outline Nonparanormal Forest Density Estimation Summary 6 Gaussian Graphical Models X ~ N , −1 d (µ Σ) Ω = Σ Ω = 0 ⇔ X ⊥ X | the rest (Lauritzen 96) jk j k glasso--Graphical Lasso (Yuan and Lin 06, Banerjee 08, Friedman et al. 08) Sample covariance ˆ min{tr(SΩ)−log | Ω | +λ Ω jk } Ω0 ∑ j ,k Negative Gaussian log-likelihood L1-regularization Neighborhood selection (Meinshausen and Buhlmann 06) 7 Gaussian Graphical Models CLIME -- Constrained L1-Minimization Method (Cai et al. 2011) ˆ min Ω jk subject to ‖SΩ−I‖max≤λ Ω ∑ j,k gDantzig -- Graphical Dantzig Selector (Yuan 2010) 8 Computation and Theory Computing: scalable up to thousands of dimensions glasso (Hastie et al.) huge (Zhao and Liu) language: Fortran language: C scalability: d<3000 scalability: d<6000 Speed: very fast Speed: 3 x faster Theory: persistency, consistency, sparsistency, optimal rate,... ⎛ logd ⎟⎞ key result for analysis ‖Sˆ−Σ‖ = O ⎜ ⎟ max P ⎜ n ⎟ ⎝ ⎠ sample covariance population covariance 9 Many Real Data are non-Gaussian Normal Q-Q plot of one typical gene Sample Quantile Sample Arabidopsis Data (Wille et al. 04) (n = 118, d=39) Theoretical Quantile Relax the Gaussian assumption without losing statistical and computational efficiency? 10 The Nonparanormal Gaussian ⇒ Gaussian Copula Nonparanormal Definition (Liu, Lafferty, Wasserman 09) A random vector is nonparanormal X = (X1,…, Xd ) X ~ NPN Σ,{ f }d d ( j j=1) in case f (X) = f (X ),…, f (X ) is normal ( 1 1 d d ) f (X) ~ N 0, . d ( Σ) Here f 's are strictly monotone and diag( ) = 1. j Σ t −µ f (t) j recover arbitrary Gaussian distributions j = ⇒ σ j 11 Visualization Bivariate nonparanormal densities with different transformations 12 Basic Properties The graph is encoded in the inverse correlation matrix Let X ~ NPN Σ,{ f }d and Ω = Σ−1, then d ( j j=1) 1 ⎧ 1 ⎫ d p (x) = exp⎪− f (x)T f (x)⎪ f ′(x ) X d/2 −1/2 ⎨ Ω ⎬∏ j j (2π) | Ω | ⎩⎪ 2 ⎭⎪ j=1 ⇒ Ω = 0 ⇔ X ⊥ X | the rest ij i j Not jointly convex, how to estimate the parameters? 13 Estimating Transformation Functions Directly estimate { f }d without worrying about j j=1 Ω CDF of X f strictly monotone f (X ) ~ N(0,1) j j j j Fj (t) = P(X j ≤ t)= P( f j (X j )≤ f j (t))= Φ( f j (t)) ⇒ −1 f j (t) = Φ (Fj (t)) Normal-score transformation n ˆ 1 i Fj (t) = ∑ I(x j ≤ t) n +1 i=1 14 Estimating Inverse Correlation Matrix Nonparanormal Algorithm (Liu, Han, Lafferty, Wasserman 12) ˆ ρ Step 1 : calculate the Spearman's rank correlation coefficient matrix R ˆ ρ ˆ ρ Step 2 : transform R into Σ according to ⎛π ⎞ (∗) ˆ ρ = 2⋅sin⎜ Rˆ ρ ⎟ ˆ ρ provides good estimate of Σ jk ⎜ jk ⎟ Σ Σ. ⎝ 6 ⎠ Step 3 : plug ˆ ρ into glasso / CLIME / gDantzig to get ˆ ρ and the graph Σ Ω The same procedure is independently proposed by (Xue and Zou 12) 15 Nonparanormal Theory Theorem (Liu, Han, Lafferty, Wasserman 12) Let and −1 Given whatever conditions on and X ~ NPNd (Σ, f ) Ω = Σ . Σ Ω that secure the consistency and sparsistency of glasso / CLIME / gDantzig under the Gaussian models, the nonparanormal is also consistent and sparsistent with exactly the same parametric rates of convergence. ⇒ The nonparanormal is a safe replacement of the Gaussian model 16 Proof of the Theorem ⎛ ⎞ ρ logd ⎟ Proof: The key is to show that ‖ Σˆ −Σ‖ = O ⎜ ⎟ . max P ⎜ n ⎟ ⎝ ⎠ For Gaussian distribution, Kruskal (1948) shows monotone transformation invariant ⎛π ⎞ Pearson’s Σ = 2⋅sin⎜ Rρ ⎟ Population Spearman’s correlation coefficient jk ⎜ jk ⎟ rank coefficient ⎝ 6 ⎠ Also true for the nonparanormal distribution ⎛ logd ⎞ ‖ ˆ ρ ‖ ‖ ˆ ρ ρ‖ ⎜ ⎟ Σ −Σ max R − R max = OP ⎜ ⎟ . ⎝⎜ n ⎠⎟ the theory of U - statistics. 17 Empirical Results For nonGaussian data, the nonparanormal >> glasso Sample xi ~ NPN , f with n = 200, d = 40 and transformation f d (Σ ) j FN FP true graph nonparanormal glasso Oracle graph: pick the best tuning parameter along the path 18 Nonparanormal: Efficiency Loss For Gaussian data, the nonparanormal almost loses no efficiency Computationally -- no extra cost 1 n Statistically -- sample x ,…, x ~ Nd (0,Σ) with n = 80 and d =100 1-FN almost no efficiency loss ROC curve for graph recovery 1-FP 19 Arabidopsis Data The nonparanormal behaves differently from glasso on the Arabidopsis data Graphical Lasso (glasso) Nonparanormal (npn) Difference λ1 The paths are different λ2 highly nonlinear ˆ f j λ3 MECPS Nonlinear transformation glasso nonparanormal difference causes graph difference 20 Scientific Implications Cross-pathway interactions? nonparanormal MEP Pathway MVA Pathway HMGR1 MECPS HMGR2 glasso Still open in the current biological literature (Hou et al. 2010) 21 Tradeoff Nonparanormal: unrestricted graphs, more flexible distributions What if the true distribution is not nonparanormal? Tradeoff structural flexibility for greater nonparametricity 22 Forest Densities Gaussian Copula Fully nonparametric distribution ⇒ A forest is an acylic graph. F = (V,EF ) A distribution is supported on a forest F=(V, EF) if p(x , x ) p (x) = i j ⋅ p(x ) F ∏ p(x )p(x ) ∏ k (i, j)∈EF i j k∈V Fˆ = (V,E ) pˆ(x , x ), pˆ(x ) Forest density estimator Fˆ i j k Advantages: visualization, computing, distributional flexibility, inference 23 Some Previous Work Most existing work on forests are for discrete distributions Chow and Liu (1968) Bach and Jordan (2003) Tan et al. (2010) Chechetka and Guestrin (2007) Our focus: statistical properties in high dimensions 24 Estimation (k) Find a forest F = argminKL(p(x)‖ pF (x)) subject to EF ≤ k F true density projection of p(x) onto F Maximum weight forest problem (Kruskal 56) (k) F = argmax ∑ I(pij ) subject to EF ≤ k F (i, j)∈EF mutual information p(x , x ) I(p ) = p(x , x )log i j dx dx ij ∫ i j p(x )p(x ) i j i j pˆ(x , x ), pˆ(x ) Clipped KDE i j k 25 Forest Density Estimation Algorithm Forest Density Estimation Algorithm 1. Sort edges according to empirical mutual information I(pˆij ) 2. Greedily pick a set of edges such that no cycles are formed 3. Output the obtained forest after k edges have been added 26 Assumptions for Forest Graph Estimation (A1) Bivariate marginals p(x , x )∈ 2nd - order Holder class j k (A2) p(x) has bounded support (e.g. [0,1]d ) and κ1 ≤ min p(x j , xk )≤ max p(x j , xk )≤κ2 j,k j,k (A3) p(x , x ) has vanishing partial derivatives on boundaries j k (A4) For a "crucial" set of edges, their mutual info. distinct enough from each other To secure enough signal-to-noise-ratio for correct structure recovery (Tan, Anandkumar, Willsky 11) 27 Forest Density Estimation Theory (k) F = argminKL(p(x)‖ pF (x)) F: EF ≤k Theorem-Oracle Sparsistency (Liu et al. 12) (k) P :densities supported by For graph estimation, let forests with at most k edges logd → 0, parametric scaling n and 1d and 2d KDEs use the same bandwidth Oracle density estimator p ( k ) F −1/4 undersmooth ˆ h n , pFˆ ( k ) p ˆ (k) (k) we have supP(F ≠ F )= o(1). Forest Estimator true density k 28 Proof of the Sparsistency Result Proof: The key is to bound ˆ I(p jk )− I(p jk ) ≤ I(pˆ jk )−EI(pˆ jk ) + EI(pˆ jk )− I(p jk ) estimated population Stochastic Bias mutual info. mutual info. 2 McDiarmaid’s inequality P(Stochastic ≥ t)≤ c1 exp(−c2nt ) 2 ⎡ ⎤ ⎡ ⎤ 2 Bias Epˆ jk (x)− p jk (x) dx + E pˆ jk (x)− p jk (x) dx ∫ ⎣ ⎦ ∫ ⎣ ⎦ 1 2 IMSE(pˆ ) 4 IBias(pˆ ) h jk h + 2 jk nh 29 Consistency Theorem-Oracle Consistency (Liu et al. 12) For density estimation,we set the bandwidths for the 1d and 2d KDE as −1/5 −1/6 and optimal rates for KDE h1 n h2 n . We have k d ‖ˆ ‖ minimax optimal supE pFˆ ( k ) − pF( k ) 1≤ C ⋅ 2/3 + 4/5 . p n n bivariate KDE univariate KDE Proof Pinsker’s inequality and the decomposability of the forest density in terms of KL-divergence 30 Arabidopsis Data NPN ● 20 ●● ● ●●● ●●●● ●●● ●●●●● ●●●●●●●●●●● ● FDE ● ●●● ●●●●●●●● ●●●●● ● ● ● ●●● ●●●●● ● ●●● ●●● ●●● ● ● ● ●●●● ● ● ●● ●●●●●● ● ● ● ●● ●● ● ● ●● ● ●●●● ● ● ● ●●● ● ● ●●● ●● ●● ● 15 ●● ● ● ● likelihood ●● ●● ● ● − ● ● ●● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● out Log ● ● ● ●● − ● ● ●● ● ● ● Held ● held-out log-likelihood ● ● ● 10 ● ● ● ● glasso ● ● ● ● ● 210 242 1624 35 6426 70 25628 Graph Size number of edges (log-scale) 31 Forest Graphs on the Arabadopsis Data FDE MEP Pathway MVA Pathway HMGR1 MECPS HMGR2 Forest density estimation is consistent with the nonparanormal 32 Nonparanormal vs.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages34 Page
-
File Size-