<<

Mini-course - Probabilistic Graphical Models: A Geometric, Algebraic and Combinatorial Perspective

Caroline Uhler

Lecture 1: Graphical Models and Markov Properties

CIMI Workshop on Computational Aspects of Toulouse

November 6, 2019

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 1 / 5 Applications of graphical models

Probabilistic models that capture the statistical dependencies between )))Mo0va0on)variables of interest in the form of a network Used throughout the natural , social sciences, and economics for modeling interactions Model)and)understand)the)link)between:) Undirected graphical models encode partial correlations, while directedSpa8al'dimension' graphical models can be used toBiochemical'dimension' represent causality

(a) Weather forecasting (b) Gene regulation

CarolineGeometry'/'Packing' Uhler Mini-course: Graphical ModelsNetworks' Toulouse, Nov 2019 2 / 5

Graphical models

Motivation: Provide an economic representation of a joint distribution using local relationships between variables

Origins of graphical models can be traced back to 3 communities: Statistical physics: use undirected graph to represent distribution over a large system of interacting particles [Gibbs, 1902] Genetics: use directed graphs to model inheritance in natural species [Wright, 1921] Statistics: use graphs to represent interactions in multi-dimensional contingency tables [Bartlett, 1935]

Graphical models combine graph theory with probability theory into a powerful framework for multivariate statistical modeling [Lauritzen, 1996] Algebraic, geometric and combinatorial questions arise naturally when studying graphical models Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 3 / 5 Overview of mini-course

(1) Introduction to graphical models - Markov properties

(2) Gaussian graphical models - Maximum likleihood estimation

(3) Covariance models with linear structure - Parameter estimation and structure learning

(4) Causal inference - Structure discovery

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 4 / 5 References: Graphical models

Bartlett, M. S. (1935). Contingency interactions. J. Royal Stat. Soc. 2:248-252. Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. Royal Stat. Soc., 36:192-236. Gibbs, J. W. (1902). Elementary Principles in Statistical Mechanics, Yale University Press. Lauritzen, S. L. (1996). Graphical Models, Clarendon Press. Moussouris, J. (1974). Gibbs and Markov random systems with constraints. J. Stat. Phys., 10:11-33. Pearl, J. (1988). Fusion, propagation and structuring in belief networks, Artificial Intelligence, 29, 357-370. Shachter, R. D. (1998). The rational pastime (for determining irrelevance and requisite information in belief networks and influence ), UAI. Verma T. & Pearl, J. (1990). Equivalence and synthesis of causal models, UAI. Verma T. & Pearl, J. (1992). An algorithm for deciding if a set of observed independencies has a causal explanation. UAI. Wright, S. (1921). Correlation and causation. J. Agricult. Res., 20:557-585. Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 5 / 5 Mini-course - Probabilistic Graphical Models: A Geometric, Algebraic and Combinatorial Perspective

Caroline Uhler

Lecture 2: Gaussian Graphical Models

CIMI Workshop on Computational Aspects of Geometry Toulouse

November 6, 2019

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 1 / 23 Overview

Lecture is based on a book chapter that I wrote for the Handbook of Graphical Models edited by M. Drton, S. Lauritzen, M. Maathuis and M. Wainwright:

C. Uhler, “Gaussian graphical models: An algebraic and geometric perspective”, available at arXiv:1707.04345

Goal of this lecture is to give an introduction to Gaussian graphical models and show that algebraic, geometric and combinatorial questions arise naturally when studying graphical models

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 2 / 23 Gaussian graphical models

Goal: Characterize relationship among a large of variables RESEARCH ARTICLE

encodes a deubiquitinating enzyme (DUB) in the encodes one of three adenosine monophosophate variants in genes emerging from the extended ESCRT pathway (17). The WDR48-encoded pro- (AMP) deaminase enzymes involved in balancing HSPome network. By using this method, we iden- tein forms stable complexes with multiple DUBs, purine levels (28). Mutations in AMPD2 have tified potentially pathogenic variants in MAG, such as USP1, USP12, and USP46, and is re- been recently linked to a neurodegenerative BICD2,andREEP2,foundinhomozygousin- quired for enzymatic activityVisualize and linked to lyso- brainstem disorderinteractions (28). In addition, the deletion tervals in three families (Fig. by 3), validating graphthe somal trafficking (18, 19). KIF1C encodes a we have identified in this study affects just the usefulness of the HSPome to identify new HSP motor protein localized to the ER/Golgi complex, longest of the three AMPD2 isoforms, indicating genes. Interacting with KIF1C in the HSPome is suggesting a role in trafficking (20). To validate that the most N-terminal domain of AMPD2 is CCDC64,encodingamemberoftheBicaudalfam- the effect of the putative splicing mutation in important to prevent motor neuron degeneration. ily (31), a paralog of the BIC2 gene that emerged in family 789, we obtained fibroblasts and confirmed ENTPD1 encodes an extracellular ectonuclease the HSPome (FDR < 0.05, table S5). Family 1370 skipping of exon 4 (fig.Gaussian S9). Defects in ESCRT are hydrolyzing adenosine nucleotidesgraphical in the synaptic displays a homozygous Ser models:608→Leu608 missense Used throughout the natural sciences, linked to neurodegenerative disorders such as cleft (29). NT5C2 encodes a downstream cytosolic change in the BIC2 gene within a homozygous frontotemporal dementia, Charcot Marie Tooth purine nucleotide 5′ phosphatase. Purine nucleo- haplotype. The Drosophila bicaudal-D protein is disease, and recently AR-HSP (21–23). Addition- tides are neuroprotective and play a critical role in associated with Golgi-to-ER transport and poten- ally, the HSP gene products SPG20, SPAST, and the ischemic and developing brain (29); thus, alter- tially regulates the rate of synaptic vesicle recycling ZYFVE26 interact withsocial components of this com- ations sciences in their levels could sensitize neurons to stress and(32). Coimmunoprecipitation economics confirmed that for modeling interactions among nodes plex (24–26). Taken together, this suggests that dis- and insult. ENTPD1 was recently identified as a BICD2 physically interacts with KIF1C (fig. S11). ruptions in ESCRT and endosomal function can candidate gene in a family with nonsyndromic in- Recently, a mutation in BICD2 was implicated in lead to HSP and other forms of neurodegeneration. tellectual disability, but HSP was not evaluated (30). adominantformofHSP(33). AMPD2, ENTPD1,andNT5C2 are involved MAG was identified as a significant poten- in purine nucleotide metabolismfor (fig. S10). continuous Nu- Candidate HSP Genes Identified by multivariatetial HSP candidate (FDR < 0.05) from the HSPome, data cleotide metabolism is linked to the neurological Network Analysis interacting with PLP1, the gene product mutated disorder Lesch-Nyhan disease, among others (27), For families that were not included in our initial in SPG2.MAGisamembrane-boundadhesion but was not previously implicated in HSP. AMPD2 analysis, we interrogated our exome database for protein implicated in myelin function, and knockout

−3 A B x 10 18 11 0.54 16 10

14 9 0.52 8 ARL6IP1 12 ATL1 GAD1 7 KIAA0196 10 0.50 AMPD2 6 SPG21 PGAP1 RTN2 8 DDHD1 5 SPG11 0.48 ARSI 6 4

MT-ATP6 Random walk similarity SLC33A1 Within group edge count NT5C2 4 3 PLP1 Network neighborhood overlap 0.46 ZFYVE26 BSC2 2 2 ENTPD1 SPG7 1 0 GJA1 0.44 VPS37A Seed proteins (known mutated in HSP)

L1CAM HSPD1

GJC2 −3 ERLIN2 C x 10 SPAST VCP AP4M1 18 11 0.54 USP8 AP4E1 KIF5A 16 10 AP4S1 14 9 SPG20 ERLIN1 0.52 KIF1C AP4B1 8 RAB3GAP2 12 7 AP5Z1 FA2H 10 0.50 ZFR 6 WDR48 8 5 ALS2 0.48 6 4

PNPLA6 Random walk similarity Within group edge count 4 3 Network neighborhood overlap 0.46 2 2 Seed proteins 1 (known mutated in HSP) 0 0.44 MARS Candidate HSP proteins Seed proteins & candidate proteins CCT5 (found mutated in this study) Fig. 2. Hereditary spastic paraplegia interactome. (A)HSPseeds+can- or seeds + candidates). Box and whisker plots denote matched null distributions didate network (edge-weighted force-directed layout), demonstrating many of (i.e., 10,000 permutations). (B) Seed(knownmutatedinHSP)versusrandom the genes known to be mutated in HSP (seeds, blue) and new HSP candidates proteins drawn with the same degree distribution. (C) Seed plus candidate HSP (red), along with others (circles) constituting the network. (B and C)Comparison versus a matching set of proteins. (Left) Within group edge count (i.e., number of(a) statistical strength ofGene HSP subnetworks with 10,000 permutations association of randomly of edges between members of the query(b) set).(Middle)Interactionneighborhood Athens stock exchange (c) Wind speed forecasting selected proteins. Dots denote the value of the metric on the true set (i.e., seeds overlap (i.e., Jaccard similarity). (Right) Network random walk similarity.

network www.sciencemag.org(NovarinoSCIENCE et al.,VOL 343 31 JANUARY 2014(Garos & Argyrakis,509 Physica A, 2007) 343, 2014)

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 3 / 23 What can you say about the space of covariance matrices? How does the space of 3 × 3 correlation matrices look like?

Gaussian Distribution

p A random vector X ∈ R follows a multivariate Gaussian distribution p p with mean µ ∈ R and covariance matrix Σ ∈ S 0 if it has density  1  f (x) = (2π)−p/2 det(Σ)−1/2 exp − (x − µ)T Σ−1(x − µ) µ,Σ 2

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 4 / 23 Gaussian Distribution

p A random vector X ∈ R follows a multivariate Gaussian distribution p p with mean µ ∈ R and covariance matrix Σ ∈ S 0 if it has density  1  f (x) = (2π)−p/2 det(Σ)−1/2 exp − (x − µ)T Σ−1(x − µ) µ,Σ 2

What can you say about the space of covariance matrices? How does the space of 3 × 3 correlation matrices look like?

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 4 / 23 Gaussian Distribution

p A random vector X ∈ R follows a multivariate Gaussian distribution p p with mean µ ∈ R and covariance matrix Σ ∈ S 0 if it has density  1  f (x) = (2π)−p/2 det(Σ)−1/2 exp − (x − µ)T Σ−1(x − µ) µ,Σ 2

What can you say about the space of covariance matrices? How does the space of 3 × 3 correlation matrices look like?

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 4 / 23 Question: Interpretation of missing edges in G?

Gaussian Graphical Model

G = (V , E) undirected graph with vertices V = {1,..., p} and edges E

p KG = {K ∈ S 0 | Kij = 0 for all i 6= j with (i, j) ∈/ E}

p A Gaussian vector X ∈ R is a Gaussian graphical model on G if

−1 p X ∼ N (µ, Σ) and Σ ∈ S 0(G).

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 5 / 23 Gaussian Graphical Model

G = (V , E) undirected graph with vertices V = {1,..., p} and edges E

p KG = {K ∈ S 0 | Kij = 0 for all i 6= j with (i, j) ∈/ E}

p A Gaussian vector X ∈ R is a Gaussian graphical model on G if

−1 p X ∼ N (µ, Σ) and Σ ∈ S 0(G).

Question: Interpretation of missing edges in G?

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 5 / 23 −1 −1 Note: Let K = Σ . Then by Schur complement,ΣA|AC =(KAA) . Hence a missing edge in G means Kij = 0, or equivalently, Xi ⊥⊥ Xj | XV \{i,j}.

Marginals and Conditionals of a Gaussian

Theorem a Let X ∼ Np(µ, Σ) and partition X into two components XA ∈ R and b XB ∈ R such that a + b = p. Let µ and Σ be partitioned accordingly, i.e., µ  Σ Σ  µ = A and Σ = A,A A,B . µB ΣB,A ΣB,B

Then,

(a) the marginal distribution of XA is N (µA, ΣA,A);

(b) the conditional distribution of XA | XB = xB is N (µA|B , ΣA|B ), where

−1 −1 µA|B = µA+ΣA,B ΣB,B (xB −µB ) and ΣA|B = ΣA,A−ΣA,B ΣB,B ΣB,A.

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 6 / 23 Marginals and Conditionals of a Gaussian

Theorem a Let X ∼ Np(µ, Σ) and partition X into two components XA ∈ R and b XB ∈ R such that a + b = p. Let µ and Σ be partitioned accordingly, i.e., µ  Σ Σ  µ = A and Σ = A,A A,B . µB ΣB,A ΣB,B

Then,

(a) the marginal distribution of XA is N (µA, ΣA,A);

(b) the conditional distribution of XA | XB = xB is N (µA|B , ΣA|B ), where

−1 −1 µA|B = µA+ΣA,B ΣB,B (xB −µB ) and ΣA|B = ΣA,A−ΣA,B ΣB,B ΣB,A.

−1 −1 Note: Let K = Σ . Then by Schur complement,ΣA|AC =(KAA) . Hence a missing edge in G means Kij = 0, or equivalently, Xi ⊥⊥ Xj | XV \{i,j}.

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 6 / 23 These problems don’t depend on mean µ; w.l.o.g. assume µ = 0 sample covariance matrix is given by n 1 X S = X (i)(X (i))T ∈ p , rk(S) = n ≤ p with probability 1 n S0 i=1 log-likelihood is given by: `(Σ; S) ∝ − log det(Σ) − tr (SΣ−1) What can be said about the log-likelihood function?

Two Main Problems

(1) (n) p Given i.i.d. samples X ,..., X ∈ R from a Gaussian graphical model Learn the graph G see tomorrow’s lectures (e.g. graphical lasso)

Estimate the edge weights, i.e. the non-zero entries of Σ−1 maximum likelihood estimation

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 7 / 23 sample covariance matrix is given by n 1 X S = X (i)(X (i))T ∈ p , rk(S) = n ≤ p with probability 1 n S0 i=1 log-likelihood is given by: `(Σ; S) ∝ − log det(Σ) − tr (SΣ−1) What can be said about the log-likelihood function?

Two Main Problems

(1) (n) p Given i.i.d. samples X ,..., X ∈ R from a Gaussian graphical model Learn the graph G see tomorrow’s lectures (e.g. graphical lasso)

Estimate the edge weights, i.e. the non-zero entries of Σ−1 maximum likelihood estimation

These problems don’t depend on mean µ; w.l.o.g. assume µ = 0

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 7 / 23 log-likelihood is given by: `(Σ; S) ∝ − log det(Σ) − tr (SΣ−1) What can be said about the log-likelihood function?

Two Main Problems

(1) (n) p Given i.i.d. samples X ,..., X ∈ R from a Gaussian graphical model Learn the graph G see tomorrow’s lectures (e.g. graphical lasso)

Estimate the edge weights, i.e. the non-zero entries of Σ−1 maximum likelihood estimation

These problems don’t depend on mean µ; w.l.o.g. assume µ = 0 sample covariance matrix is given by n 1 X S = X (i)(X (i))T ∈ p , rk(S) = n ≤ p with probability 1 n S0 i=1

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 7 / 23 What can be said about the log-likelihood function?

Two Main Problems

(1) (n) p Given i.i.d. samples X ,..., X ∈ R from a Gaussian graphical model Learn the graph G see tomorrow’s lectures (e.g. graphical lasso)

Estimate the edge weights, i.e. the non-zero entries of Σ−1 maximum likelihood estimation

These problems don’t depend on mean µ; w.l.o.g. assume µ = 0 sample covariance matrix is given by n 1 X S = X (i)(X (i))T ∈ p , rk(S) = n ≤ p with probability 1 n S0 i=1 log-likelihood is given by: `(Σ; S) ∝ − log det(Σ) − tr (SΣ−1)

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 7 / 23 Two Main Problems

(1) (n) p Given i.i.d. samples X ,..., X ∈ R from a Gaussian graphical model Learn the graph G see tomorrow’s lectures (e.g. graphical lasso)

Estimate the edge weights, i.e. the non-zero entries of Σ−1 maximum likelihood estimation

These problems don’t depend on mean µ; w.l.o.g. assume µ = 0 sample covariance matrix is given by n 1 X S = X (i)(X (i))T ∈ p , rk(S) = n ≤ p with probability 1 n S0 i=1 log-likelihood is given by: `(Σ; S) ∝ − log det(Σ) − tr (SΣ−1) What can be said about the log-likelihood function?

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 7 / 23 Parameter estimation for Gaussian graphical models

Given a graph G, the maximum likelihood estimator (MLE) Kˆ := Σˆ −1 solves the following convex optimization problem: maximize log det K − tr (SK)

subject to K ∈ KG

Question: What is the MLE when G is the complete graph?

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 8 / 23 Theorem (Dempster 1972) In a Gaussian graphical model on G the MLE Σˆ exists if and only if the partial sample covariance matrix SG = (Sij | (i, j) ∈ E or i = j) (sufficient statistics) can be extended to a positive definite matrix. Then the MLE Σˆ is the unique completion whose inverse satisfies −1 (Σˆ )ij = 0, ∀ i 6= j, (i, j) ∈/ E.

Existence of MLE is equivalent to positive definite matrix completion problem!

Parameter estimation for Gaussian graphical models

By strong duality: Given a graph G, the MLE Kˆ := Σˆ −1 solves the following equivalent convex optimization problems: maximize log det K − tr (KS) minimize − log det Σ − p

subject to Kij = 0, ∀(i, j) ∈/ E subject to Σij = Sij , (i, j) ∈ E or i = j

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 9 / 23 Parameter estimation for Gaussian graphical models

By strong duality: Given a graph G, the MLE Kˆ := Σˆ −1 solves the following equivalent convex optimization problems: maximize log det K − tr (KS) minimize − log det Σ − p

subject to Kij = 0, ∀(i, j) ∈/ E subject to Σij = Sij , (i, j) ∈ E or i = j

Theorem (Dempster 1972) In a Gaussian graphical model on G the MLE Σˆ exists if and only if the partial sample covariance matrix SG = (Sij | (i, j) ∈ E or i = j) (sufficient statistics) can be extended to a positive definite matrix. Then the MLE Σˆ is the unique completion whose inverse satisfies −1 (Σˆ )ij = 0, ∀ i 6= j, (i, j) ∈/ E.

Existence of MLE is equivalent to positive definite matrix completion problem!

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 9 / 23 GeometryGeometry

Geometric Concentration Picture matrices: Covariance matrices:

p ∨ SG := πG (S), SG := πG (S0); note that SG = KG p fiberG (S) := {Σ ∈ S0 | ΣG = SG }

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 10 / 23 

2 2 2 2 det(K) = λ1 · (λ1 −λ2 +λ2λ3 −λ3 +λ2λ4 +λ3λ4 −λ4) ·

2 2 2 2 (λ1 −λ2 −λ2λ3 −λ3 −λ2λ4 −λ3λ4 −λ4)

 Example K2,3

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 11 / 23 

 Example K2,3

2 2 2 2 det(K) = λ1 · (λ1 −λ2 +λ2λ3 −λ3 +λ2λ4 +λ3λ4 −λ4) · 2 2 2 2 (λ1 −λ2 −λ2λ3 −λ3 −λ2λ4 −λ3λ4 −λ4)

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 11 / 23  Example K2,3



2 2 2 2 det(K) = λ1 · (λ1 −λ2 +λ2λ3 −λ3 +λ2λ4 +λ3λ4 −λ4) ·

2 2 2 2 (λ1 −λ2 −λ2λ3 −λ3 −λ2λ4 −λ3λ4 −λ4)

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 11 / 23   Example K2,3  





Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 12 / 23   Example K2,3  





Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 12 / 23   Example K2,3  





Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 12 / 23   Example K2,3  





Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 12 / 23 all specified minors are pd However, this is in general not sufficient:  1 0.9 ? −0.9  0.9 1 0.9 ?  SG =   does not have a pd completion.  ? 0.9 1 0.9  −0.9 ? 0.9 1

Theorem (Grone, Johnson, S´a& Wolkovicz, 1984)

For a graph G the following statements are equivalent: |E ∗| (a) A G-partial matrix MG ∈ R has a pd completion if and only if all completely specified submatrices in MG are positive definite. (b) G is chordal (also known as triangulated), i.e. every cycle of length 4 or larger has a chord.

Positive Definite (pd) Matrix Completion Problem

Necessary condition for existence of pd completion:

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 13 / 23 Positive Definite (pd) Matrix Completion Problem

Necessary condition for existence of pd completion: all specified minors are pd However, this is in general not sufficient:  1 0.9 ? −0.9  0.9 1 0.9 ?  SG =   does not have a pd completion.  ? 0.9 1 0.9  −0.9 ? 0.9 1

Theorem (Grone, Johnson, S´a& Wolkovicz, 1984)

For a graph G the following statements are equivalent: |E ∗| (a) A G-partial matrix MG ∈ R has a pd completion if and only if all completely specified submatrices in MG are positive definite. (b) G is chordal (also known as triangulated), i.e. every cycle of length 4 or larger has a chord.

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 13 / 23 Statistical Problem

Current statistical applications: Number of variables >> Number of observations Example: Genetic networks Gene expression data of a few individuals to model interaction between large number of genes

→ Gaussian graphical models widely used in this context

Problem: What is the minimum number of observations for existence of the MLE in a given Gaussian graphical model?

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 14 / 23   



Example K2,3               

  

    

What is the minimal rank             n∗ such that   s11 ? s13 s14 s15  ? s22 s23 s24 s25   SG = s13 s23 s33 ??    s14 s24 ? s44 ?  s15 s25 ?? s55

p can be completed to a positive definite matrix for any S ∈ S0 of rank n ≥ n∗?

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 15 / 23 ∗ nG ≥ maximal clique size of G ∗ nG ≤ p

Theorem (Grone, Johnson, S´a& Wolkovicz, 1984) For chordal (i.e. triangulated) graphs n∗ = maximal clique size of G.

Let G be non-chordal. Then ∗ nG ≥ maximal clique size of G ∗ nG ≤ maximal clique size in minimal chordal cover of G

Bounds

∗ p Let nG denote the minimal rank such that every S ∈ S0 has a positive definite completion on G

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 16 / 23 Theorem (Grone, Johnson, S´a& Wolkovicz, 1984) For chordal (i.e. triangulated) graphs n∗ = maximal clique size of G.

Let G be non-chordal. Then ∗ nG ≥ maximal clique size of G ∗ nG ≤ maximal clique size in minimal chordal cover of G

Bounds

∗ p Let nG denote the minimal rank such that every S ∈ S0 has a positive definite completion on G ∗ nG ≥ maximal clique size of G ∗ nG ≤ p

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 16 / 23 Let G be non-chordal. Then ∗ nG ≥ maximal clique size of G ∗ nG ≤ maximal clique size in minimal chordal cover of G

Bounds

∗ p Let nG denote the minimal rank such that every S ∈ S0 has a positive definite completion on G ∗ nG ≥ maximal clique size of G ∗ nG ≤ p

Theorem (Grone, Johnson, S´a& Wolkovicz, 1984) For chordal (i.e. triangulated) graphs n∗ = maximal clique size of G.

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 16 / 23 Bounds

∗ p Let nG denote the minimal rank such that every S ∈ S0 has a positive definite completion on G ∗ nG ≥ maximal clique size of G ∗ nG ≤ p

Theorem (Grone, Johnson, S´a& Wolkovicz, 1984) For chordal (i.e. triangulated) graphs n∗ = maximal clique size of G.

Let G be non-chordal. Then ∗ nG ≥ maximal clique size of G ∗ nG ≤ maximal clique size in minimal chordal cover of G

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 16 / 23              

        Geometric View

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 17 / 23 Elimination Criterion

Theorem (Uhler, 2012)

Let In be the ideal of (n + 1) × (n + 1) minors of a symmetric p × p matrix of unknowns S. Let IG,n be the elimination ideal obtained from In by eliminating all unknowns corresponding to non-edges in the graph. If

IG,n = 0

∗ then nG ≤ n.

In corresponds to all symmetric matrices of rank ≤ n

Elimination corresponds to projection onto SG

IG,n = 0 means that the projection is full-dimensional

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 18 / 23 Theorem (Gross and Sullivant, 2018) ∗ ∗ For any grid, nG = 3. Furthermore, for any planar graph, n ≤ 4.

  

      !"##$%!"##$%                  3 × 3 grid            



  Theorem (Uhler, 2012)  When G is the 3 × 3 grid, then n∗ = 3.     G         First example of a graph for which n∗ < maximal clique size in                 !"!"G       minimal chordal cover   "#       !   "#      Solves an open problem by Steffen Lauritzen !                       

 Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 19 / 23   

      !"##$%!"##$%                  3 × 3 grid            



  Theorem (Uhler, 2012)  When G is the 3 × 3 grid, then n∗ = 3.     G         First example of a graph for which n∗ < maximal clique size in                 !"!"G       minimal chordal cover   "#       !   "#      Solves an open problem by Steffen Lauritzen !                        Theorem (Gross and Sullivant, 2018) ∗ ∗ For any grid, nG = 3. Furthermore, for any planar graph, n ≤ 4.  Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 19 / 23 Computing the MLE

Convex optimization problem; can be solved e.g. using interior point methods or coordinate descent algorithms (often faster)

There is a closed-form formula for the MLE ⇐⇒ G is chordal (Lauritzen, 1996)

ML-degree: maximal number of solutions to the likelihood equations

There is a rational formula for the MLE (in the entries of S) ⇐⇒ ML-degree is 1 ⇐⇒ G is chordal (Sturmfels & Uhler, 2010)

Conjecture The p-cycle maximizes the ML-degree over all graphs on p nodes and has ML-degree (p − 3)2p−2 + 1

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 20 / 23 The sparsity order of a graph G is defined as

ord(G) = max{rk (X ) | X ∈ KG extremal}

There should be strong connections between existence of the MLE, ML-degree and sparsity order of a graph, but these are still quite unclear (Solus, Uhler & Yoshida, 2016)

Alternative Approach: Sparsity Order of a Graph

SG PD completable if and only if hSG , X i > 0 for all X ∈ KG extremal

Knowledge of extremal rays of KG is useful for deciding PD completability

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 21 / 23 Alternative Approach: Sparsity Order of a Graph

SG PD completable if and only if hSG , X i > 0 for all X ∈ KG extremal

Knowledge of extremal rays of KG is useful for deciding PD completability

The sparsity order of a graph G is defined as

ord(G) = max{rk (X ) | X ∈ KG extremal}

There should be strong connections between existence of the MLE, ML-degree and sparsity order of a graph, but these are still quite unclear (Solus, Uhler & Yoshida, 2016)

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 21 / 23 Sparsity Order of a Graph

ord(G) = 1 if and only if G chordal (Agler et al., 1988)

If H is an induced subgraph of G, then ord(H) ≤ ord(G) (Agler et al., 1988)

If G is the clique sum of two graphs G1 and G2, then ord(G) = max{ord(G1), ord(G2)} (Helton et al. 1989) ord(G) ≤ p − 2 with equality if and only if G is a p-cycle; the extremal ranks are 1 and p − 2 (Helton et al. 1989)

2 2  m −m + 1 if n ≥ m −m + 1 ord(K ) = 2 2 ; (Grone & Pierce, 1990) m,n n otherwise all ranks 1,..., ord(Km,n) are extremal

All graphs of order 2 have been characterized (Laurent, 2001)

Many many open problems...

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 22 / 23 References

Agler, Helton, McCullough & Rodman: Positive semidefinite matrices with a given sparsity pattern (Linear algebra and its applications 107, 1988) Boyd & Vandenberghe: Convex Optimization (Cambridge University Press, 2004) Dempster: Covariance selection (Biometrics 28, 1972) Grone, Johnson, S`a& Wolkowicz: Positive definite completions of partial Hermitian matrices (Linear Algebra and its Applications 58, 1984) Grone & Pierce: Extremal bipartite matrices (Linear Algebra & its Applications 131, 1990) Gross & Sullivant: The maximum likelihood threshold of a graph (Bernoulli 24, 2018) Helton, Pierce & Rodman: The ranks of extremal positive semidefinite matrices with given sparsity pattern (SIAM Journal on Matrix Analysis and Applications 10, 1989) Laurent: On the sparsity order of a graph and its deficiency in chordality (Combinatorica 21, 2001) Lauritzen: Graphical Models (Oxford University Press, 1996) Solus, Uhler & Yoshida: Extremal positive semidefinite matrices whose sparsity pattern is given by graphs without K5 minors (Linear Algebra and its Applications 509, 2016) Sturmfels & Uhler: Multivariate Gaussians, semidefinite matrix completion, and convex algebraic geometry (Annals of the Institute of Statistical Mathematics 62, 2010) Uhler: Geometry of maximum likelihood estimation in Gaussian graphical models (Annals of Statistics 40, 2012)

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 23 / 23 Mini-course - Probabilistic Graphical Models: A Geometric, Algebraic and Combinatorial Perspective

Caroline Uhler

Lecture 3: Structure Learning in Undirected Graphical Models

CIMI Workshop on Computational Aspects of Geometry Toulouse

November 7, 2019

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 1 / 15 Sample covariance matrix S is of rank min(n, p)

MLE:

Kˆ = argmax log det(K) trace(SK) K 0, K = 0 (i, j) / E { − |  ij ∀ ∈ } In general unbounded if n < p Given a graph G what is the minimal n such that this problem is bounded (i.e., the MLE exists)? Geometric problem →

(Undirected) Gaussian graphical models

X (0, Σ), K := Σ 1, p =nr. of variables, n =nr. of samples ∼ N − Gaussian graphical model: (i, j) / E if and only if K = 0 ∈ ij if and only if Xi Xj XV i,j ⊥⊥ | \{ }

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 2 / 15 (Undirected) Gaussian graphical models

X (0, Σ), K := Σ 1, p =nr. of variables, n =nr. of samples ∼ N − Gaussian graphical model: (i, j) / E if and only if K = 0 ∈ ij if and only if Xi Xj XV i,j ⊥⊥ | \{ } Sample covariance matrix S is of rank min(n, p)

MLE:

Kˆ = argmax log det(K) trace(SK) K 0, K = 0 (i, j) / E { − |  ij ∀ ∈ } In general unbounded if n < p Given a graph G what is the minimal n such that this problem is bounded (i.e., the MLE exists)? Geometric problem →

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 2 / 15 Geometry

Concentration matrices: Covariance matrices: Geometric Picture

p πG : projection onto edge set, SG := πG (S), G := πG (S ) S 0 Note that =  SG KG∨ MLE for S exists if and only if S int( ) G ∈ SG Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 3 / 15 Geometric Picture

MLE exists for n samples, if projection of manifold of rank n psd matrices lies in the interior of the cone [Uhler, arXiv:1707.04345] SG Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 4 / 15 Graphical lasso: Kˆ = argmax log det(K) trace(SK) λ K λ { − − | |1} sparsistent for particular choice of λ (under certain assumptions) [Ravikumar, Wainwright, Raskutti & Yu, 2011]

Kˆλ is not monotone in λ: edges can disappear/appear for increasing λ [Fattahi & Sojoudi, 2019] Kˆλ is not invariant to rescaling

Additional approaches include: node-wise regression with the lasso (Meinshausen & B¨uhlmann,2006)

CLIME: constrained `1-based optimization (Cai, Liu & Luo, 2011) Algorithm with false discovery rate control (Liu, 2013) ROCKET: for heavy-tailed distributions (Foygel-Barber & Kolar, 2018) Conditional independence testing

Structure learning in (undirected) graphical models

MLE: Kˆ = argmax log det(K) trace(SK) { − } Kˆ is dense even if n >> p

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 5 / 15 Additional approaches include: node-wise regression with the lasso (Meinshausen & B¨uhlmann,2006)

CLIME: constrained `1-based optimization (Cai, Liu & Luo, 2011) Algorithm with false discovery rate control (Liu, 2013) ROCKET: for heavy-tailed distributions (Foygel-Barber & Kolar, 2018) Conditional independence testing

Structure learning in (undirected) graphical models

MLE: Kˆ = argmax log det(K) trace(SK) { − } Kˆ is dense even if n >> p

Graphical lasso: Kˆ = argmax log det(K) trace(SK) λ K λ { − − | |1} sparsistent for particular choice of λ (under certain assumptions) [Ravikumar, Wainwright, Raskutti & Yu, 2011]

Kˆλ is not monotone in λ: edges can disappear/appear for increasing λ [Fattahi & Sojoudi, 2019] Kˆλ is not invariant to rescaling

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 5 / 15 Structure learning in (undirected) graphical models

MLE: Kˆ = argmax log det(K) trace(SK) { − } Kˆ is dense even if n >> p

Graphical lasso: Kˆ = argmax log det(K) trace(SK) λ K λ { − − | |1} sparsistent for particular choice of λ (under certain assumptions) [Ravikumar, Wainwright, Raskutti & Yu, 2011]

Kˆλ is not monotone in λ: edges can disappear/appear for increasing λ [Fattahi & Sojoudi, 2019] Kˆλ is not invariant to rescaling

Additional approaches include: node-wise regression with the lasso (Meinshausen & B¨uhlmann,2006)

CLIME: constrained `1-based optimization (Cai, Liu & Luo, 2011) Algorithm with false discovery rate control (Liu, 2013) ROCKET: for heavy-tailed distributions (Foygel-Barber & Kolar, 2018) Conditional independence testing

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 5 / 15 Motivation: Graphical models under positive dependence

How to model strong forms of positive depen- dence in data?

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 6 / 15 Theorem (FLSUWZ, 2017)

If p(x) > 0 and MTP2, then p(x) is faithful to an undirected graph.

Positive dependence and MTP2 distributions Q A distribution (i.e. density function) p on = v V v , with v R X ∈ X X ⊆ discrete or open, is multivariate totally positive of order 2( MTP2) if

p(x)p(y) p(x y)p(x y) for all x, y , ≤ ∧ ∨ ∈ X where and are applied coordinate-wise. ∧ ∨

Theorem (FortuinKasteleynGinibre inequality, 1971, Karlin & Rinott, 1980)

MTP2 implies positive association, i.e. cov φ(X ), ψ(X ) 0 { } ≥ for any non-decreasing functions φ, ψ : Rm R. →

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 7 / 15 Positive dependence and MTP2 distributions Q A distribution (i.e. density function) p on = v V v , with v R X ∈ X X ⊆ discrete or open, is multivariate totally positive of order 2( MTP2) if

p(x)p(y) p(x y)p(x y) for all x, y , ≤ ∧ ∨ ∈ X where and are applied coordinate-wise. ∧ ∨

Theorem (FortuinKasteleynGinibre inequality, 1971, Karlin & Rinott, 1980)

MTP2 implies positive association, i.e. cov φ(X ), ψ(X ) 0 { } ≥ for any non-decreasing functions φ, ψ : Rm R. →

Theorem (FLSUWZ, 2017)

If p(x) > 0 and MTP2, then p(x) is faithful to an undirected graph.

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 7 / 15 Sample distribution is MTP2! If you sample a correlation matrix uniformly 6 at random the probability of it being MTP2 is < 10− !

Gaussian MTP2 distributions

Theorem (Bølviken 1982, Karlin & Rinott, 1983)

A multivariate Gaussian distribution p(x; K) is MTP2 if and only if the inverse covariance matrix K is an M-matrix, that is K 0 for all u = v. uv ≤ 6

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 8 / 15 Sample distribution is MTP2! If you sample a correlation matrix uniformly 6 at random the probability of it being MTP2 is < 10− !

Gaussian MTP2 distributions

Theorem (Bølviken 1982, Karlin & Rinott, 1983)

A multivariate Gaussian distribution p(x; K) is MTP2 if and only if the inverse covariance matrix K is an M-matrix, that is K 0 for all u = v. uv ≤ 6

Ex: 2016 Monthly correlations of global stock markets (InvestmentFrontier.com) Nasdaq Canada Europe UK Australia  1.000 0.606 0.731 0.618 0.613  Nasdaq 0.606 1.000 0.550 0.661 0.598 Canada   S = 0.731 0.550 1.000 0.644 0.569  Europe  0.618 0.661 0.644 1.000 0.615  UK 0.613 0.598 0.569 0.615 1.000 Australia

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 8 / 15 Gaussian MTP2 distributions

Theorem (Bølviken 1982, Karlin & Rinott, 1983)

A multivariate Gaussian distribution p(x; K) is MTP2 if and only if the inverse covariance matrix K is an M-matrix, that is K 0 for all u = v. uv ≤ 6

Ex: 2016 monthly correlations of global stock markets (InvestmentFrontier.com) Nasdaq Canada Europe UK Australia  2.629 0.480 1.249 0.202 0.490 Nasdaq 0.480− 2.109 −0.039 −0.790 −0.459 Canada  − − − −  S−1 = 1.249 0.039 2.491 0.675 0.213 Europe  −0.202 −0.790 0.675− 2.378 −0.482 UK −0.490 −0.459 −0.213 0.482− 1.992 Australia − − − −

Sample distribution is MTP2! If you sample a correlation matrix uniformly 6 at random the probability of it being MTP2 is < 10− !

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 8 / 15 MTP2 constraints are often implicit

MTP2 constraints are often implicit

Lamin

X is MTP in: | | 2 Gaussian / binary tree models MTP constraints are oftenTranscription implicit 2 factors RNA RNA Gaussian / binary latent tree models Polymerase transcript Binary latent class models Single factor analysis models

X is MTP2 in: X is MTP2 in: ferromagnetic Ising models | | Gaussian / binary tree models Markov chains with MTP transitions 2 Gaussian / binary latent tree models order statistics of i.i.d. variables Binary latent class models Brownian motion tree models Single factor analysis models

X is MTP2 in: ferromagnetic Ising models

Markov chains with MTP2 transitions CarolineCaroline Uhler (MIT) Uhler EstimatingMini-course: Covariance Matrices Graphical Models Vienna, June 2017Toulouse, 16 / 25 Nov 2019 9 / 15 order statistics of i.i.d. variables

Caroline Uhler (MIT) MTP2 distributions UW, March 2017 10 / 21 Negative dependence: What this talk is not about...

Analog of FKG inequality does not hold: negative association, i.e. cov φ(X ), ψ(X ) 0 for any non-decreasing functions φ, ψ is not { } ≤ Exploiting Positive and Negative Dependence in Data / Models implied by p(x)p(y) p(x y)p(x y) for all x, y. ≥ ∧ ∨ Positive dependence Increased performance in applications: See Pemantle (1999): Towards a Theory of Negative Association - Portfolio optimization (Uhler) - Recommendation systems (Jegelka / Sra) Strongly Rayleigh measures: sufficient for conditionally negative association [Borcea, Br¨anden & Liggett, 2009]Enhanced sample complexity: Recently used in various machine learning applications to enforce - Structure recovery in graphical models Bresler / Willsky diversity,Recently e.g. used recommender in various machine systems, learning neural applications network sparsification, to enforce ( ) - Non-parametric density estimation matrixdiversity, sketching, e.g. in recommender diversity priors systems, neuralNegative network dependence sparsification, matrix sketching, diversity priors (Rigollet / Uhler) Enhanced computational properties: See NeurIPS 2018 Tutorial by - (Anti-) ferromagnetic Ising models (Bresler / Gamarnik / Mossel / Shah) Jegelka & Sra - Matrix approximation, neural network compression (Jegelka / Sra) Caroline Uhler (MIT) MTP2 distributions Chicago, July 2019 11 / 27 Enhanced statistical and computational properties with boosts for applications!

Negative dependence: NOT analogous!!

Analog of FKG inequality does not hold: negative association, i.e. cov φ(X ), ψ(X ) 0 for any non-decreasing functions φ, ψ is not { } ≤ implied by p(x)p(y) p(x y)p(x y) for all x, y. ≥ ∧ ∨ See Pemantle (1999): Towards a Theory of Negative Association

Strongly Rayleigh measures: sufficient for conditionally negative association [Borcea, Br¨anden & Liggett, 2009]

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 10 / 15 NegativeNegative dependence: dependence: NOT What analogous!! this talk is not about...

AnalogAnalog of of FKG FKG inequality inequality does does not not hold: hold: negative negative association, association, i.e.i.e.covcovφφ(X(X),), ψ ψ(X(X)) 00 for for any any non-decreasing non-decreasing functions functionsφ,φ, ψ ψisis not not {{ }} ≤ ≤ Exploiting Positive and Negative Dependence in Data / Models impliedimplied by bypp(x(x)p)p(y(y)) pp(x(x yy)p)p(x(x yy)) for for all allxx, ,yy. . ≥≥ ∧∧ ∨∨ See Pemantle (1999): Towards a Theory ofPositive Negative dependence Association Increased performance in applications: See Pemantle (1999): Towards a Theory of Negative Association - Portfolio optimization (Uhler) Strongly Rayleigh measures: sufficient for conditionally negative - Recommendation systems (Jegelka / Sra) Strongly Rayleigh measures: sufficient for conditionally negative association [Borcea, Br¨anden & Liggett, 2009] association [Borcea, Br¨anden & Liggett, 2009]Enhanced sample complexity: Recently used in various machine learning applications to enforce - Structure recovery in graphical models Bresler / Willsky diversity,Recently e.g. used recommender in various machine systems, learning neural applications network sparsification, to enforce ( ) - Non-parametric density estimation matrixdiversity, sketching, e.g. in recommender diversity priors systems, neuralNegative network dependence sparsification, matrix sketching, diversity priors (Rigollet / Uhler) Enhanced computational properties: See NeurIPS 2018 Tutorial by - (Anti-) ferromagnetic Ising models (Bresler / Gamarnik / Mossel / Shah) Jegelka & Sra - Matrix approximation, neural network compression (Jegelka / Sra) Caroline Uhler (MIT) MTP2 distributions Chicago, July 2019 11 / 27 Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 10 / 15 Enhanced statistical and computational properties with boosts for applications! Theorem (Slawski & Hein, 2015) The MLE in a Gaussian MTP model exists with probability 1 when n 2. 2 ≥ New proof: 3 lines using ultrametrics [Lauritzen, U. & Zwiernik, 2019]

Theorem (Wang, Roy & U., 2019) Graphical model inference by testing the signs of the empirical partial correlation coefficients is consistent in the high-dimensional setting without the need of any tuning parameter. With `1-penalty, the resulting estimator is monotone.

ML Estimation for Gaussian MTP2 distributions

ML Estimation for Gaussian MTP2 distributions ML Estimation for Gaussian MTP2 distributions Let S be the sample covariance matrix. Then maximum likelihood estimation is a convex optimization problemmaximize: log det(✓) trace(✓S) ✓ 0 ⌫ Primal: Max-Likelihood Dual:subject Entropy to ✓ 0, u = v. uv  8 6 maximize log det(K) trace(KS) minimize log det(⌃) p K 0 ⌃ 0 ⌫ ⌫ subject to Kuv 0, u = v. subject to ⌃ = S , ⌃ S .  8 6 vv vv uv uv minimize log det(⌃) p ⌃ 0 ⌫ subject to ⌃ = S , ⌃ S . vv vv uv uv

Caroline Uhler (MIT) Estimating Covariance Matrices Vienna, June 2017 20 / 29

Caroline Uhler (MIT) Estimating Covariance Matrices Vienna, June 2017 17 / 27

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 11 / 15 Theorem (Wang, Roy & U., 2019) Graphical model inference by testing the signs of the empirical partial correlation coefficients is consistent in the high-dimensional setting without the need of any tuning parameter. With `1-penalty, the resulting estimator is monotone.

ML Estimation for Gaussian MTP2 distributions

ML Estimation for Gaussian MTP2 distributions ML Estimation for Gaussian MTP2 distributions Let S be the sample covariance matrix. Then maximum likelihood estimation is a convex optimization problemmaximize: log det(✓) trace(✓S) ✓ 0 ⌫ Primal: Max-Likelihood Dual:subject Entropy to ✓ 0, u = v. uv  8 6 maximize log det(K) trace(KS) minimize log det(⌃) p K 0 ⌃ 0 ⌫ ⌫ subject to Kuv 0, u = v. subject to ⌃ = S , ⌃ S .  8 6 vv vv uv uv minimize log det(⌃) p ⌃ 0 ⌫ Theoremsubject (Slawskito ⌃ = S &, Hein,⌃ S 2015). vv vv uv uv The MLE in a Gaussian MTP model exists with probability 1 when n 2. 2 ≥ New proof: 3 lines using ultrametrics [Lauritzen, U. & Zwiernik, 2019] Caroline Uhler (MIT) Estimating Covariance Matrices Vienna, June 2017 20 / 29

Caroline Uhler (MIT) Estimating Covariance Matrices Vienna, June 2017 17 / 27

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 11 / 15 ML Estimation for Gaussian MTP2 distributions

ML Estimation for Gaussian MTP2 distributions ML Estimation for Gaussian MTP2 distributions Let S be the sample covariance matrix. Then maximum likelihood estimation is a convex optimization problemmaximize: log det(✓) trace(✓S) ✓ 0 ⌫ Primal: Max-Likelihood Dual:subject Entropy to ✓ 0, u = v. uv  8 6 maximize log det(K) trace(KS) minimize log det(⌃) p K 0 ⌃ 0 ⌫ ⌫ subject to Kuv 0, u = v. subject to ⌃ = S , ⌃ S .  8 6 vv vv uv uv minimize log det(⌃) p ⌃ 0 ⌫ Theoremsubject (Slawskito ⌃ = S &, Hein,⌃ S 2015). vv vv uv uv The MLE in a Gaussian MTP model exists with probability 1 when n 2. 2 ≥ New proof: 3 lines using ultrametrics [Lauritzen, U. & Zwiernik, 2019] Caroline Uhler (MIT) Estimating Covariance Matrices Vienna, June 2017 20 / 29 Theorem (Wang, Roy & U., 2019)

Caroline Uhler (MIT)Graphical modelEstimating inference Covariance Matrices by testing theVienna, signs June 2017 ofthe 17 / 27 empirical partial correlation coefficients is consistent in the high-dimensional setting without the need of any tuning parameter. With `1-penalty, the resulting estimator is monotone.

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 11 / 15 Application: Portfolio selection [Agrawal, Roy & U., 2019]

Daily stock return data from the Center for Research in Security Prices (CRSP) between 1975-2015 (NYSE, AMEX & NASDAQ stock exchanges).

M T Linear Approximate EW-TQ MTP (nr. of assets) (lookback period) Shrinkage Factor Model 2

100 25 0.694 0.710 0.730 0.803 50 0.694 0.625 0.637 0.849 100 0.694 0.600 0.617 0.896 200 0.694 0.670 0.688 0.899 400 0.694 0.736 0.782 0.892 1260 0.694 0.831 0.834 0.890

200 50 0.757 0.742 0.726 0.853 100 0.757 0.719 0.716 0.829 200 0.757 0.812 0.800 0.885 400 0.757 0.864 0.870 0.886 800 0.757 0.967 0.961 0.970 1260 0.757 0.906 0.916 0.955

500 125 0.764 0.876 0.872 1.019 250 0.764 0.985 0.977 1.112 500 0.764 0.940 0.980 1.045 1000 0.764 0.918 0.978 1.061

Information ratio (ratio of average return to standard deviation of returns) when weights are estimated based on “full” Markowitz portfolio problem Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 12 / 15 Conclusions

Graphical models combine graph theory with probability theory into a powerful framework for multivariate statistical modeling

Total positivity constraints are often implicit and reflect real processes ferromagnetism latent tree models

MTP2 implies faithfulness

MTP2 is well-suited for high-dimensional applications (also in non-parametric setting, see our recent work)

Explicit MTP2 constraints enhance interpretability of graphical models (induce sparsity without the need of a tuning parameter)

MTP2 distributions not only have broad applications (finance, psychology, genomics), but also lead to beautiful theory (exponential families, convexity, combinatorics, semialgebraic geometry)

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 13 / 15 References: Graphical models

Bartlett (1935). Contingency table interactions.

Cai, Liu & Luo (2011) A constrained `1 minimization approach to sparse precision matrix estimation Fattahi & Sojoudi (2019). Graphical lasso and thresholding: Equivalence and closed-form solutions. Foygel-Barber & Kolar (2018) ROCKET: robust confidence intervals via Kendall’s Tau for transelliptical graphical models Friedman, Hastie, and Tibshirani (2007). Sparse inverse covariance estimation with the graphical lasso. Gibbs (1902). Elementary Principles in Statistical Mechanics, Yale Univ. Press. Lauritzen, S. L. (1996). Graphical Models, Clarendon Press. Liu (2013) Gaussian graphical model estimation with false discovery rate control Meinshausen & B¨uhlmann(2006). High-dimensional graphs and variable selection with the lasso Ravikumar, Wainwright, Raskutti & Yu (2011). High-dimensional covariance estimation by minimizing `1-penalized log-determinant divergence. Uhler (2017). Gaussian graphical models: An algebraic and geometric perspective. Wright, S. (1921). Correlation and causation. J. Agricult. Res., 20:557-585.

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 14 / 15 References: Total positivity Agrawal, Roy & Uhler (2019). Covariance matrix estimation under total positivity for portfolio selection Bolviken (1982). Probability inequalities for the multivariate normal with non-negative partial correlations. Fallat, Lauritzen, Sadeghi, Uhler, Wermuth & Zwiernik (2017). Total positivity in Markov structures. Fortuin, Kasteleyn & Ginibre, J. (1971). Correlation inequalities on some partially ordered sets. Karlin & Rinott (1983). M-matrices as covariance matrices of multinormal distributions. Lauritzen, Uhler & Zwiernik (2019). Maximum likelihood estimation in Gaussian models under total positivity. Lauritzen, Uhler & Zwiernik (2019). Total positivity in structured binary distributions (arXiv:1905.00516) Lebowitz (1972). Bounds on the correlations and analyticity properties of ferromagnetic Ising spin systems. Slawski & Hein (2014). Estimation of positive definite M-matrices and structure learningfor attractive Gaussian Markov random fields Wang, Roy, & Uhler (2019). Learning high-dimensional Gaussian graphical models under total positivity without adjustment of tuning parameters (arXiv:1906.05159) Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 15 / 15 Mini-course - Probabilistic Graphical Models: A Geometric, Algebraic and Combinatorial Perspective

Caroline Uhler

Lecture 4: Causal Structure Discovery

CIMI Workshop on Computational Aspects of Geometry Toulouse

November 7, 2019

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 1 / 20 1 Interaction between genetics and causal inference could be particularly beneficial: Geneticists can perform interventional experiments relatively easily Drop-seq and Perturb-seq: High-throughput (100,000-1 mio single-cell measurements on all 20,000 genes per experiment) observational and interventional single-cell RNA-seq data is now available

1 Unique data and challenges!

Causal inference

Framework for causal inference from observational data (structural equation models) developed in 1920’s by J. Neyman and S. Wright Skepticism amongst statisticians halted the developments for 50 years Reemergence in the 1970’s after major contributions by J. Pearl (CS), J. Robins (epidemiology), D. Rubin (stats) & P. Spirtes (philosophy)

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 2 / 20 Causal inference

Framework for causal inference from observational data (structural equation models) developed in 1920’s by J. Neyman and S. Wright Skepticism amongst statisticians halted the developments for 50 years Reemergence in the 1970’s after major contributions by J. Pearl (CS), J. Robins (epidemiology), D. Rubin (stats) & P. Spirtes (philosophy)

1 Interaction between genetics and causal inference could be particularly beneficial: Geneticists can perform interventional experiments relatively easily Drop-seq and Perturb-seq: High-throughput (100,000-1 mio single-cell measurements on all 20,000 genes per experiment) observational and interventional single-cell RNA-seq data is now available

1 Unique data and challenges!

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 2 / 20 Gene expression data - single-cell RNA-seq

Causal inference using both observational Causal network and interventional data s e n

2 e g

0

1 0 0 ,

3 0 2 Expression ~

4 1 5

-1 Genes Single~100,000 cells samples Interventions

2 s e n e

1 g

3 0 0 0 ,

Expression 0 2 4 ~ 5 1

-1 Genes Single~100,000 cells samples Sample data Perturb-seq: large scale gene expression data after interventions is available Perturb-seq: High-throughput observational and interventional single-cell RNA-seq data is now available [Dixit et al., 2016]

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 3 / 20

Structural equation model also defines interventional distribution:

Perfect(hard) intervention on X2: X2 = c

General intervention on X2: X2 = f˜2(X1, ˜2)

Structural equation models

Introduced by Sewell Wright in the 1920s

Represent causal relationships by a directedDAGs and acyclic conditional graph(DAG independence)

Each node is associated with a random variable; stochasticity is introduced by independent noise variables i

X1 = f1(X3,✏1)

X2 = f2(X1,✏2)

X3 = f3(✏3)

X4 = f4(X2, X3,✏4)

Caroline Uhler (MIT) Causal Inference & Algebraic Statistics Atlanta, July 2017 3 / 28 Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 4 / 20

Structural equation models

Introduced by Sewell Wright in the 1920s

Represent causal relationships by a directedDAGs and acyclic conditional graph(DAG independence)

Each node is associated with a random variable; stochasticity is introduced by independent noise variables i

X1 = f1(X3,✏1)

X2 = f2(X1,✏2)

X3 = f3(✏3)

X4 = f4(X2, X3,✏4)

Structural equation model also defines interventional distribution:

Perfect(hard) intervention on X2: X2 = c

General intervention on X2: X2 = f˜2(X1, ˜2)

Caroline Uhler (MIT) Causal Inference & Algebraic Statistics Atlanta, July 2017 3 / 28 Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 4 / 20

Markov equivalence classes on 3 nodes & talk overview

Markov equivalence: different DAGs can encode same conditional independence relations (through factorization of the joint distribution)

1 Interventional Markov equivalence classes? 1 How do they depend on the type of intervention? Do perfect interventions provide smaller equivalence classes than imperfect interventions? Algorithms for learning the interventional Markov equivalence class?

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 5 / 20 Hauser and B¨uhlmann (2012): characterized -Markov equivalence classes under perfect interventions: an edge isI orientable if it is orientable from observational data adjacent to an intervened node Theorem (Yang, Katcoff & Uhler, ICML 2018) The -Markov equivalence classes under perfect and imperfect I interventions are the same.

Proof: By introducing & providing a graphical criterion for the -Markov property for -DAGs. I I

DefinitionDefinitionDefinition 5 5 (Intervention 5 (Intervention (Intervention graph) graph) graph)LetLetLetDD=([D=([=([p]p,E]p,E)beaDAGwithvertexset[],E)beaDAGwithvertexset[)beaDAGwithvertexset[p]andedgesetp]andedgesetp]andedgeset EE(seeE(see(see Appendix Appendix Appendix A.1), A.1), A.1), and and andI I I[p[]aninterventiontarget.Thep[]aninterventiontarget.Thep]aninterventiontarget.Theinterventioninterventionintervention graph graph graphofofDofDisDis theis the the ⊂⊂⊂ DAGDAGDAGDD(ID)(I=([)(I=([) =([p]p,E]p,E(],EI)(I),)(),I where)), where whereEE(I)E(I:=)(I:=) :=(a,(a, b()a, b) b()a,(a, b()a, b) b)E,bE,bE,b / /I /I. I. . { { { | | | ∈∈∈ ∈∈}∈} } ForForFor a acausal causala causal model model model (D, (D, ( fD,), f), fan), an interventional an interventional interventional density density densityf(f(fdo(doDdo(X(XI(X==U=IU))U)) obeys)) obeys obeys the the the Markov Markov Markov property property property ·| ·| ·| D D I I I I ofofDofD(ID)(I:theMarkovpropertyoftheobservationaldensityisinheri)(:theMarkovpropertyoftheobservationaldensityisinheriI):theMarkovpropertyoftheobservationaldensityisinherited.ted.ted. Figure Figure Figure 1 1shows 1shows shows an an example an example example ofof aof a DAG aDAG DAG and and and two two two corresponding corresponding corresponding intervention intervention intervention graphs. graphs. graphs. AsAs foreshadowedAs foreshadowed foreshadowed in in the in the the introduction, introduction, introduction, we we arewe are are interested interested interested in in ca in causal causalusal inference inference inference based based based on on dataon data data sets sets sets originatingoriginatingoriginating from from frommultiplemultiplemultipleinterventions,interventions,interventions, that that that means means means from from from a aset aset ofset of the of the the form form form===(I(I,(f˜I, f)˜, f)˜J )J ,whereJ ,where,where S { j j jj j}jj=1j=1j=1 ˜ ˜ ˜ S S { { } } IjIjIj[p[]isaninterventiontargetandp[]isaninterventiontargetandp]isaninterventiontargetandfjfjaleveldensityonfaleveldensityonj aleveldensityonIj IjforIforj for 1 1 1j j jJ.WecallsuchasetJ.WecallsuchasetJ.Wecallsuchaset ⊂⊂⊂ XXX ≤≤≤≤≤≤ J J J ananinterventionaninterventionintervention setting setting setting,andthecorresponding(multi)setofinterventiontargets,andthecorresponding(multi)setofinterventiontargets,andthecorresponding(multi)setofinterventiontargets===IjI I I I I{ {}{jj}=1jj}=1j=1 a afamilyafamilyfamily of of targets of targets targets.Weoftenusethefamilyoftargetsasanindexset,forexampl.Weoftenusethefamilyoftargetsasanindexset,forexampl.Weoftenusethefamilyoftargetsasanindexset,forexampletowriteaetowriteaetowritea ˜ correspondingcorrespondingcorresponding intervention intervention intervention setting setting setting as as as===(I,(I,f(II,f)˜If)˜II )I .I . . SS S{ { { } }∈I}∈I∈I WeWeWe consider consider considerinterventionalinterventionalinterventional data data dataofof sampleof sample sample size size sizen nproducednproducedproduced by by aby a causal acausal causal model model model (D, (D, ( fD,)under f)under f)under ˜ (1)(1)(1) (n()n)(n) anan interventionan intervention intervention setting setting setting===(I,(I,f(II,f)˜If)˜II )I .WeassumethattheI .Weassumethatthe.Weassumethatthen nsamplesnsamplessamplesXXX,...,X,...,X,...,Xareareare inde- inde- inde- SS S{ { { } }∈I}∈I∈I pendent,pendent,pendent, and and and write write write them them them as as usual as usual usual as as rows as rows rows of of a of adata adatadata matrix matrix matrix X X.However,theyare X.However,theyare.However,theyarenotnotnotidenticallyidenticallyidentically distributeddistributeddistributed as as they as they they arise arise arise from from fromdidifferentdifferentfferentinterventions.interventions.interventions. The The The interventional interventional interventional data data data set set setis is fully is fully fully specifi specifi specifiededed byby theby the the pair pair pair ( (, (X, X),, X),), T T T T (1)T (1)T (1) ——X—X(1)X(1)—(1)—— . . . n n n . . . === . . . , , X, X=X== . . . , , , (2)(2)(2) T T T⎛⎛⎛ . .⎞⎞∈⎞∈I∈I I ⎛⎛⎛ . . ⎞⎞⎞ T (Tn()Tn)(n) ——X—X(nX()n—)(n—) — ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎝⎝⎝ ⎠⎠⎠ ⎝⎝⎝ ⎠⎠⎠ wherewherewhere for for for each each eachi i i [n[],n[],nT],(Ti)(Ti)denotes(i)denotesdenotes the the the intervention intervention intervention target target target under under under which which which the the the sample sample sampleXX(i)X(i)was(i)waswas ∈∈∈ produced.produced.produced. This This This data data data set set setcan can can potentially potentially potentially contain contain contain observatio observatio observationalnalnal data data data as as well, as well, well, namely namely namely if if if .To.To.To ∅∈∅∈∅∈I I I summarize,summarize,summarize, we we considerwe consider consider the the the statistical statistical statistical model model model

XX(1)X(1),X(1),X(2),X(2),...,X(2),...,X,...,X(n()nindependent,)(nindependent,) independent, (i)(i)(i) (i)(i)(i) ˜ XXX f f f dodoDdo(X(X(X ==U=U(iU)()i)(),Ui) ),U,U(i)(i)(i)f f˜(if)˜(,ii)(,ii) ,i=1=1=1,...,n,...,n,...,n , , , (3)(3)(3) ∼∼∼·|·|·| D DT (Ti)(Ti)(i) T T T T T T∼∼∼T T T andandand we we assumewe assume assume that that that each each each' target' target' targetI I I appearsappearsappears at at( least at( least( least once once once in in the in the the sequence sequence sequence. . . ∈∈I∈I I T T T 2.22.22.2 Interventional Interventional Interventional Markov Markov Markov Equivalence: Equivalence: Equivalence: New New New Concepts Concepts Concepts and and and Re Results Resultssults AnAnAn intervention intervention intervention at at some at some some target target targeta a a[p[]destroystheoriginalcausalinfluenceofothervariablesp[]destroystheoriginalcausalinfluenceofothervariablesp]destroystheoriginalcausalinfluenceofothervariables ∈∈∈ ofof theof the the system system system on onX onXa.InterventionaldatathereofcanhencenotbeusedtodetermXa.Interventionaldatathereofcanhencenotbeusedtodeterma.Interventionaldatathereofcanhencenotbeusedtodetermineineine the the the causal causal causal parentsparentsparents of ofX ofXa Xina ina thein the the (undisturbed) (undisturbed) (undisturbed) system. system. system. To To be To be able be able able to to estimate to estimate estimate at at least at least least the the t completehe complete complete skeleton skeleton skeleton of of of acausalstructure(asintheobservationalcase),anintervacausalstructure(asintheobservationalcase),anintervacausalstructure(asintheobservationalcase),aninterventionentionention experiment experiment experiment has has has to to be to be performed be performed performed basedbasedbased on on a on aconservative aconservativeconservativefamilyfamilyfamily of of targets: of targets: targets:

DefinitionDefinitionDefinition 6 6 (Conservative 6 (Conservative (Conservative family family family of of targets) of targets) targets)AfamilyoftargetsAfamilyoftargetsAfamilyoftargetsisis calledis called calledconservativeconservativeconservative I I I ifif forif for forall alla alla a[p[],p[], therep], there there is is some is some someI I I suchsuchsuch that that thata/a/a/I.I.I. Interventional∈∈∈ Markov∈∈I∈I I equivalence∈∈∈ class InIn thisIn this this paper, paper, paper, we we restrictwe restrict restrict our our our considerations considerations considerations to toconservative toconservativeconservativefamiliesfamiliesfamilies of of targets; of targets; targets; see see see Section Section Section 2.3 2.3 2.3 for for for Let be a set of intervention targets amoredetaileddiscussion.Notethateveryexperimentinwhamoredetaileddiscussion.Notethateveryexperimentinwhamoredetaileddiscussion.NotethateveryexperimentinwhI ichichich we we wealso also also measure measure measure observational observational observational datadatadata corresponds corresponds corresponds to to a to aconservative aconservative conservative family family family of of targets. of targets. targets. Ex: Perfect interventions = , 4 , 3, 5 I {∅ { } { }} 123412341234123412341234123412341234

567567567 567567567 567567567 (a) D ( 4( )4( )4 ) ( 3(,53(,)53,)5 ) (a)(a)DD (b)(b)D(b)D{ D(}{ 4}{ )} (c)(c)D(c)(D{3D,{5}{)} } (a) G ∅ (b) G { } (c) G { }

( 4( 4)( )4 ) ( 3(,53(,5)3,)5 ) FigureFigureFigure 1: 1: A 1: A DAG A DAG DAGDDandDandand the the the corresponding corresponding corresponding intervention intervention intervention graphs graphs graphsDD{D{} }{and}andandDD{D{ }{ }. .} .

4 4 4

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 6 / 20 Theorem (Yang, Katcoff & Uhler, ICML 2018) The -Markov equivalence classes under perfect and imperfect I interventions are the same.

Proof: By introducing & providing a graphical criterion for the -Markov property for -DAGs. I I

DefinitionDefinitionDefinition 5 5 (Intervention 5 (Intervention (Intervention graph) graph) graph)LetLetLetDD=([D=([=([p]p,E]p,E)beaDAGwithvertexset[],E)beaDAGwithvertexset[)beaDAGwithvertexset[p]andedgesetp]andedgesetp]andedgeset EE(seeE(see(see Appendix Appendix Appendix A.1), A.1), A.1), and and andI I I[p[]aninterventiontarget.Thep[]aninterventiontarget.Thep]aninterventiontarget.Theinterventioninterventionintervention graph graph graphofofDofDisDis theis the the ⊂⊂⊂ DAGDAGDAGDD(ID)(I=([)(I=([) =([p]p,E]p,E(],EI)(I),)(),I where)), where whereEE(I)E(I:=)(I:=) :=(a,(a, b()a, b) b()a,(a, b()a, b) b)E,bE,bE,b / /I /I. I. . { { { | | | ∈∈∈ ∈∈}∈} } ForForFor a acausal causala causal model model model (D, (D, ( fD,), f), fan), an interventional an interventional interventional density density densityf(f(fdo(doDdo(X(XI(X==U=IU))U)) obeys)) obeys obeys the the the Markov Markov Markov property property property ·| ·| ·| D D I I I I ofofDofD(ID)(I:theMarkovpropertyoftheobservationaldensityisinheri)(:theMarkovpropertyoftheobservationaldensityisinheriI):theMarkovpropertyoftheobservationaldensityisinherited.ted.ted. Figure Figure Figure 1 1shows 1shows shows an an example an example example ofof aof a DAG aDAG DAG and and and two two two corresponding corresponding corresponding intervention intervention intervention graphs. graphs. graphs. AsAs foreshadowedAs foreshadowed foreshadowed in in the in the the introduction, introduction, introduction, we we arewe are are interested interested interested in in ca in causal causalusal inference inference inference based based based on on dataon data data sets sets sets originatingoriginatingoriginating from from frommultiplemultiplemultipleinterventions,interventions,interventions, that that that means means means from from from a aset aset ofset of the of the the form form form===(I(I,(f˜I, f)˜, f)˜J )J ,whereJ ,where,where S { j j jj j}jj=1j=1j=1 ˜ ˜ ˜ S S { { } } IjIjIj[p[]isaninterventiontargetandp[]isaninterventiontargetandp]isaninterventiontargetandfjfjaleveldensityonfaleveldensityonj aleveldensityonIj IjforIforj for 1 1 1j j jJ.WecallsuchasetJ.WecallsuchasetJ.Wecallsuchaset ⊂⊂⊂ XXX ≤≤≤≤≤≤ J J J ananinterventionaninterventionintervention setting setting setting,andthecorresponding(multi)setofinterventiontargets,andthecorresponding(multi)setofinterventiontargets,andthecorresponding(multi)setofinterventiontargets===IjI I I I I{ {}{jj}=1jj}=1j=1 a afamilyafamilyfamily of of targets of targets targets.Weoftenusethefamilyoftargetsasanindexset,forexampl.Weoftenusethefamilyoftargetsasanindexset,forexampl.Weoftenusethefamilyoftargetsasanindexset,forexampletowriteaetowriteaetowritea ˜ correspondingcorrespondingcorresponding intervention intervention intervention setting setting setting as as as===(I,(I,f(II,f)˜If)˜II )I .I . . SS S{ { { } }∈I}∈I∈I WeWeWe consider consider considerinterventionalinterventionalinterventional data data dataofof sampleof sample sample size size sizen nproducednproducedproduced by by aby a causal acausal causal model model model (D, (D, ( fD,)under f)under f)under ˜ (1)(1)(1) (n()n)(n) anan interventionan intervention intervention setting setting setting===(I,(I,f(II,f)˜If)˜II )I .WeassumethattheI .Weassumethatthe.Weassumethatthen nsamplesnsamplessamplesXXX,...,X,...,X,...,Xareareare inde- inde- inde- SS S{ { { } }∈I}∈I∈I pendent,pendent,pendent, and and and write write write them them them as as usual as usual usual as as rows as rows rows of of a of adata adatadata matrix matrix matrix X X.However,theyare X.However,theyare.However,theyarenotnotnotidenticallyidenticallyidentically distributeddistributeddistributed as as they as they they arise arise arise from from fromdidifferentdifferentfferentinterventions.interventions.interventions. The The The interventional interventional interventional data data data set set setis is fully is fully fully specifi specifi specifiededed byby theby the the pair pair pair ( (, (X, X),, X),), T T T T (1)T (1)T (1) ——X—X(1)X(1)—(1)—— . . . n n n . . . === . . . , , X, X=X== . . . , , , (2)(2)(2) T T T⎛⎛⎛ . .⎞⎞∈⎞∈I∈I I ⎛⎛⎛ . . ⎞⎞⎞ T (Tn()Tn)(n) ——X—X(nX()n—)(n—) — ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎝⎝⎝ ⎠⎠⎠ ⎝⎝⎝ ⎠⎠⎠ wherewherewhere for for for each each eachi i i [n[],n[],nT],(Ti)(Ti)denotes(i)denotesdenotes the the the intervention intervention intervention target target target under under under which which which the the the sample sample sampleXX(i)X(i)was(i)waswas ∈∈∈ produced.produced.produced. This This This data data data set set setcan can can potentially potentially potentially contain contain contain observatio observatio observationalnalnal data data data as as well, as well, well, namely namely namely if if if .To.To.To ∅∈∅∈∅∈I I I summarize,summarize,summarize, we we considerwe consider consider the the the statistical statistical statistical model model model

XX(1)X(1),X(1),X(2),X(2),...,X(2),...,X,...,X(n()nindependent,)(nindependent,) independent, (i)(i)(i) (i)(i)(i) ˜ XXX f f f dodoDdo(X(X(X ==U=U(iU)()i)(),Ui) ),U,U(i)(i)(i)f f˜(if)˜(,ii)(,ii) ,i=1=1=1,...,n,...,n,...,n , , , (3)(3)(3) ∼∼∼·|·|·| D DT (Ti)(Ti)(i) T T T T T T∼∼∼T T T andandand we we assumewe assume assume that that that each each each' target' target' targetI I I appearsappearsappears at at( least at( least( least once once once in in the in the the sequence sequence sequence. . . ∈∈I∈I I T T T 2.22.22.2 Interventional Interventional Interventional Markov Markov Markov Equivalence: Equivalence: Equivalence: New New New Concepts Concepts Concepts and and and Re Results Resultssults AnAnAn intervention intervention intervention at at some at some some target target targeta a a[p[]destroystheoriginalcausalinfluenceofothervariablesp[]destroystheoriginalcausalinfluenceofothervariablesp]destroystheoriginalcausalinfluenceofothervariables ∈∈∈ ofof theof the the system system system on onX onXa.InterventionaldatathereofcanhencenotbeusedtodetermXa.Interventionaldatathereofcanhencenotbeusedtodeterma.Interventionaldatathereofcanhencenotbeusedtodetermineineine the the the causal causal causal parentsparentsparents of ofX ofXa Xina ina thein the the (undisturbed) (undisturbed) (undisturbed) system. system. system. To To be To be able be able able to to estimate to estimate estimate at at least at least least the the t completehe complete complete skeleton skeleton skeleton of of of acausalstructure(asintheobservationalcase),anintervacausalstructure(asintheobservationalcase),anintervacausalstructure(asintheobservationalcase),aninterventionentionention experiment experiment experiment has has has to to be to be performed be performed performed basedbasedbased on on a on aconservative aconservativeconservativefamilyfamilyfamily of of targets: of targets: targets:

DefinitionDefinitionDefinition 6 6 (Conservative 6 (Conservative (Conservative family family family of of targets) of targets) targets)AfamilyoftargetsAfamilyoftargetsAfamilyoftargetsisis calledis called calledconservativeconservativeconservative I I I ifif forif for forall alla alla a[p[],p[], therep], there there is is some is some someI I I suchsuchsuch that that thata/a/a/I.I.I. Interventional∈∈∈ Markov∈∈I∈I I equivalence∈∈∈ class InIn thisIn this this paper, paper, paper, we we restrictwe restrict restrict our our our considerations considerations considerations to toconservative toconservativeconservativefamiliesfamiliesfamilies of of targets; of targets; targets; see see see Section Section Section 2.3 2.3 2.3 for for for Let be a set of intervention targets amoredetaileddiscussion.Notethateveryexperimentinwhamoredetaileddiscussion.Notethateveryexperimentinwhamoredetaileddiscussion.NotethateveryexperimentinwhI ichichich we we wealso also also measure measure measure observational observational observational datadatadata corresponds corresponds corresponds to to a to aconservative aconservative conservative family family family of of targets. of targets. targets. Ex: Perfect interventions = , 4 , 3, 5 I {∅ { } { }} 123412341234123412341234123412341234

567567567 567567567 567567567 (a) D ( 4( )4( )4 ) ( 3(,53(,)53,)5 ) (a)(a)DD (b)(b)D(b)D{ D(}{ 4}{ )} (c)(c)D(c)(D{3D,{5}{)} } (a) G ∅ (b) G { } (c) G { }

( 4( 4)( )4 ) ( 3(,53(,5)3,)5 ) FigureFigureFigure 1: 1: A 1: A DAG A DAG DAGDDandDandand the the the corresponding corresponding corresponding intervention intervention intervention graphs graphs graphsDD{D} andandandDD{D } . . . Hauser and B¨uhlmann (2012): characterized -Markov{ }{ equivalence} { { } } classes under perfect interventions: an edge isI orientable if it is 4 4 4 orientable from observational data adjacent to an intervened node

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 6 / 20 DefinitionDefinitionDefinition 5 5 (Intervention 5 (Intervention (Intervention graph) graph) graph)LetLetLetDD=([D=([=([p]p,E]p,E)beaDAGwithvertexset[],E)beaDAGwithvertexset[)beaDAGwithvertexset[p]andedgesetp]andedgesetp]andedgeset EE(seeE(see(see Appendix Appendix Appendix A.1), A.1), A.1), and and andI I I[p[]aninterventiontarget.Thep[]aninterventiontarget.Thep]aninterventiontarget.Theinterventioninterventionintervention graph graph graphofofDofDisDis theis the the ⊂⊂⊂ DAGDAGDAGDD(ID)(I=([)(I=([) =([p]p,E]p,E(],EI)(I),)(),I where)), where whereEE(I)E(I:=)(I:=) :=(a,(a, b()a, b) b()a,(a, b()a, b) b)E,bE,bE,b / /I /I. I. . { { { | | | ∈∈∈ ∈∈}∈} } ForForFor a acausal causala causal model model model (D, (D, ( fD,), f), fan), an interventional an interventional interventional density density densityf(f(fdo(doDdo(X(XI(X==U=IU))U)) obeys)) obeys obeys the the the Markov Markov Markov property property property ·| ·| ·| D D I I I I ofofDofD(ID)(I:theMarkovpropertyoftheobservationaldensityisinheri)(:theMarkovpropertyoftheobservationaldensityisinheriI):theMarkovpropertyoftheobservationaldensityisinherited.ted.ted. Figure Figure Figure 1 1shows 1shows shows an an example an example example ofof aof a DAG aDAG DAG and and and two two two corresponding corresponding corresponding intervention intervention intervention graphs. graphs. graphs. AsAs foreshadowedAs foreshadowed foreshadowed in in the in the the introduction, introduction, introduction, we we arewe are are interested interested interested in in ca in causal causalusal inference inference inference based based based on on dataon data data sets sets sets originatingoriginatingoriginating from from frommultiplemultiplemultipleinterventions,interventions,interventions, that that that means means means from from from a aset aset ofset of the of the the form form form===(I(I,(f˜I, f)˜, f)˜J )J ,whereJ ,where,where S { j j jj j}jj=1j=1j=1 ˜ ˜ ˜ S S { { } } IjIjIj[p[]isaninterventiontargetandp[]isaninterventiontargetandp]isaninterventiontargetandfjfjaleveldensityonfaleveldensityonj aleveldensityonIj IjforIforj for 1 1 1j j jJ.WecallsuchasetJ.WecallsuchasetJ.Wecallsuchaset ⊂⊂⊂ XXX ≤≤≤≤≤≤ J J J ananinterventionaninterventionintervention setting setting setting,andthecorresponding(multi)setofinterventiontargets,andthecorresponding(multi)setofinterventiontargets,andthecorresponding(multi)setofinterventiontargets===IjI I I I I{ {}{jj}=1jj}=1j=1 a afamilyafamilyfamily of of targets of targets targets.Weoftenusethefamilyoftargetsasanindexset,forexampl.Weoftenusethefamilyoftargetsasanindexset,forexampl.Weoftenusethefamilyoftargetsasanindexset,forexampletowriteaetowriteaetowritea ˜ correspondingcorrespondingcorresponding intervention intervention intervention setting setting setting as as as===(I,(I,f(II,f)˜If)˜II )I .I . . SS S{ { { } }∈I}∈I∈I WeWeWe consider consider considerinterventionalinterventionalinterventional data data dataofof sampleof sample sample size size sizen nproducednproducedproduced by by aby a causal acausal causal model model model (D, (D, ( fD,)under f)under f)under ˜ (1)(1)(1) (n()n)(n) anan interventionan intervention intervention setting setting setting===(I,(I,f(II,f)˜If)˜II )I .WeassumethattheI .Weassumethatthe.Weassumethatthen nsamplesnsamplessamplesXXX,...,X,...,X,...,Xareareare inde- inde- inde- SS S{ { { } }∈I}∈I∈I pendent,pendent,pendent, and and and write write write them them them as as usual as usual usual as as rows as rows rows of of a of adata adatadata matrix matrix matrix X X.However,theyare X.However,theyare.However,theyarenotnotnotidenticallyidenticallyidentically distributeddistributeddistributed as as they as they they arise arise arise from from fromdidifferentdifferentfferentinterventions.interventions.interventions. The The The interventional interventional interventional data data data set set setis is fully is fully fully specifi specifi specifiededed byby theby the the pair pair pair ( (, (X, X),, X),), T T T T (1)T (1)T (1) ——X—X(1)X(1)—(1)—— . . . n n n . . . === . . . , , X, X=X== . . . , , , (2)(2)(2) T T T⎛⎛⎛ . .⎞⎞∈⎞∈I∈I I ⎛⎛⎛ . . ⎞⎞⎞ T (Tn()Tn)(n) ——X—X(nX()n—)(n—) — ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎝⎝⎝ ⎠⎠⎠ ⎝⎝⎝ ⎠⎠⎠ wherewherewhere for for for each each eachi i i [n[],n[],nT],(Ti)(Ti)denotes(i)denotesdenotes the the the intervention intervention intervention target target target under under under which which which the the the sample sample sampleXX(i)X(i)was(i)waswas ∈∈∈ produced.produced.produced. This This This data data data set set setcan can can potentially potentially potentially contain contain contain observatio observatio observationalnalnal data data data as as well, as well, well, namely namely namely if if if .To.To.To ∅∈∅∈∅∈I I I summarize,summarize,summarize, we we considerwe consider consider the the the statistical statistical statistical model model model

XX(1)X(1),X(1),X(2),X(2),...,X(2),...,X,...,X(n()nindependent,)(nindependent,) independent, (i)(i)(i) (i)(i)(i) ˜ XXX f f f dodoDdo(X(X(X ==U=U(iU)()i)(),Ui) ),U,U(i)(i)(i)f f˜(if)˜(,ii)(,ii) ,i=1=1=1,...,n,...,n,...,n , , , (3)(3)(3) ∼∼∼·|·|·| D DT (Ti)(Ti)(i) T T T T T T∼∼∼T T T andandand we we assumewe assume assume that that that each each each' target' target' targetI I I appearsappearsappears at at( least at( least( least once once once in in the in the the sequence sequence sequence. . . ∈∈I∈I I T T T 2.22.22.2 Interventional Interventional Interventional Markov Markov Markov Equivalence: Equivalence: Equivalence: New New New Concepts Concepts Concepts and and and Re Results Resultssults AnAnAn intervention intervention intervention at at some at some some target target targeta a a[p[]destroystheoriginalcausalinfluenceofothervariablesp[]destroystheoriginalcausalinfluenceofothervariablesp]destroystheoriginalcausalinfluenceofothervariables ∈∈∈ ofof theof the the system system system on onX onXa.InterventionaldatathereofcanhencenotbeusedtodetermXa.Interventionaldatathereofcanhencenotbeusedtodeterma.Interventionaldatathereofcanhencenotbeusedtodetermineineine the the the causal causal causal parentsparentsparents of ofX ofXa Xina ina thein the the (undisturbed) (undisturbed) (undisturbed) system. system. system. To To be To be able be able able to to estimate to estimate estimate at at least at least least the the t completehe complete complete skeleton skeleton skeleton of of of acausalstructure(asintheobservationalcase),anintervacausalstructure(asintheobservationalcase),anintervacausalstructure(asintheobservationalcase),aninterventionentionention experiment experiment experiment has has has to to be to be performed be performed performed basedbasedbased on on a on aconservative aconservativeconservativefamilyfamilyfamily of of targets: of targets: targets:

DefinitionDefinitionDefinition 6 6 (Conservative 6 (Conservative (Conservative family family family of of targets) of targets) targets)AfamilyoftargetsAfamilyoftargetsAfamilyoftargetsisis calledis called calledconservativeconservativeconservative I I I ifif forif for forall alla alla a[p[],p[], therep], there there is is some is some someI I I suchsuchsuch that that thata/a/a/I.I.I. Interventional∈∈∈ Markov∈∈I∈I I equivalence∈∈∈ class InIn thisIn this this paper, paper, paper, we we restrictwe restrict restrict our our our considerations considerations considerations to toconservative toconservativeconservativefamiliesfamiliesfamilies of of targets; of targets; targets; see see see Section Section Section 2.3 2.3 2.3 for for for Let be a set of intervention targets amoredetaileddiscussion.Notethateveryexperimentinwhamoredetaileddiscussion.Notethateveryexperimentinwhamoredetaileddiscussion.NotethateveryexperimentinwhI ichichich we we wealso also also measure measure measure observational observational observational datadatadata corresponds corresponds corresponds to to a to aconservative aconservative conservative family family family of of targets. of targets. targets. Ex: Perfect interventions = , 4 , 3, 5 I {∅ { } { }} 123412341234123412341234123412341234

567567567 567567567 567567567 (a) D ( 4( )4( )4 ) ( 3(,53(,)53,)5 ) (a)(a)DD (b)(b)D(b)D{ D(}{ 4}{ )} (c)(c)D(c)(D{3D,{5}{)} } (a) G ∅ (b) G { } (c) G { }

( 4( 4)( )4 ) ( 3(,53(,5)3,)5 ) FigureFigureFigure 1: 1: A 1: A DAG A DAG DAGDDandDandand the the the corresponding corresponding corresponding intervention intervention intervention graphs graphs graphsDD{D} andandandDD{D } . . . Hauser and B¨uhlmann (2012): characterized -Markov{ }{ equivalence} { { } } classes under perfect interventions: an edge isI orientable if it is 4 4 4 orientable from observational data adjacent to an intervened node Theorem (Yang, Katcoff & Uhler, ICML 2018) The -Markov equivalence classes under perfect and imperfect I interventions are the same.

Proof: By introducing & providing a graphical criterion for the -Markov property for -DAGs. I I Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 6 / 20 Algorithms for learning causal graphs

There are two main types of algorithms for learning causal graphs from observational data:

Constraint-based: treat causal search as constraint satisfaction problem; constraints given by conditional independence; main example: PC algorithm [Spirtes, Glymour & Scheines, 2001]

Properties: very fast, with consistency guarantees (with prob. 1 as n ), require large sample size, tend to miss edges → ∞

Score-based: maximize score (e.g. BIC) of a Markov equivalence class with respect to a data set by greedy search; main example: Greedy Equivalence Search(GES) [Chickering, 2002]

Properties: higher accuracy for same sample size, huge search space, theoretical consistency guarantees

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 7 / 20  # Limitation=AMNCSPB<È #  # of score-based approaches ! #

Âȗ <Ş%2ȗ V9üȗ ȗ ´9ȗ CQXȗ ,ȗ ýȗ V ȗ D5È ­E#% ȗ]šȗ] ȗ ȗ ƣ ȗ ƽȗ,ȗ  Rȗ+ ȗȗ:1 ȗ Yȗȗ = *1ȗ  Ǟ ȗ 1 ȗ 1ȗ DBBȗ 1 ȗ +ȗ 9ȗ ) ®ǖ# ȗ Gȗ ]ŊCQXȗ ]]%ȗ  ȗȗȗE'#% ȗ ȗ ȗ,ȗ < D2ȗ <Ĩ 4ȗ ;Lȗ 4‹ȗ &ȗ bȗ 26666zȗ ‹Jȗ VȗEȗ ȗ,ȗ1ȗȗ ȗȗRȗ' ȗ ȗ/W % Rȗ Mȗ ;;ȗ 2ª@ȗ Ė©6©6@ȗ ǩȗ_ȗâ2 Ēȗ ] /Êȗ ʼnƤ/ þȗ Ďj'#ȗ ;!!52ȗ ŜCȗ @ȗ 47Jȗ xM@-ȗ ėM47!bȗ g % Rȗ/ȗ 'ȗ ȗ '/ȗ 9ȗ/ȗ 1ȗ ȗR ȗ G ȗ ?Óȗ/ ȗ Ŏå ȗ ŌŸȗœ/ ȗȗ ŅŶ_Ðȗ  5ȗ Ŧȗ Jȗ 7-7bȗ w&!!!&ȗ ē&!-77ȗ ȗ +ȗȗa1 %ȗ % ȗ ȗãȗ /ȗ  Ržȗȗ ȗ ȗ 6ȗ 46-7&Jȗ 2&7&M7ȗ Ĕ&766-ȗ ȗ /Ñȗ /%ȗ ='Ǎȗ ȏȗ  ȗ ƥȗ ȗ °ȗ -ȗ M;&J;J-4ȗ 2&-@@Mȗ 2&767ȗ  DB2ŝȗ Cȗ ,ȗ  ¶ȗ ,ȗ Pȗ ,#ȗ ȗ ,ȗ 7ȗ &4b4MM@&Jȗ L&-67ȗ 2b--J@ȗ ȗ  Hȗ ȗ 'ȗ ȗ ȗ 1ȗ ,ȗ  =ȗ !ȗ M&6b66J6&ĩ4&4Mȗ 2&6777ȗ ĕ&-J!ȗ ,#ȗȗ ȗ B ȗ .ȗȗ ŷȗ BB2ȗ 4ȗ ;;47!&Jª@!J!-J;@;ȗ 2&6-!!ȗ ¤&-J-ȗ ~*ȗ ȗ  Gȗ ȗ  ED'a' ȗ +ȗ  %ȗ D= ȗ  . CQĿÿȗ  %ȗ t< /ȗ +ȗ %ȗ ȗ 'ȗ /' ' ȗ (Gillispie & Perlman, 2001)   ȗ ȗ ȗ*Õ\ȗȗ,%Gȗ ȗ ȗ :<ȗȄ= _ȗ Ļ* D/ȗ 4ȗ  ,ȗ 1ȗ GĘȗCQXȗ ' ȗ % ȗ  ȗ  . + * $!. şȗ ű%Š„„µ„ȗ Ǔŭ·ȗProblem ǟȗ "  %. of enumerating$ . ( .  . MarkovÀ ¨ -2ȗ equivalence Cȗ  ) . &. +ȗ classesǽsȗ ) + . and their,sȗ sizes leads Vȗ ,ȗ %Âȗ  Pȗ ,‘ȗ ȗ  <ȗ 2ȗ ȗ 1ȗ'ȗ ȗ ȗ G %ȗ   :'<ȗ ȗ ȗ : ȗ   ȗ19ȗ ' ȗȗ Yȗ RG ȗ '#ȗ  ȗtoȗ hard  =\ȗ and beautiful Ì#ȗ,'9ȗ combinatoricsȗ=  1 ȗ#%tȗ problems +ȗ/ ȗ: e.g.,2&-Lȗ formulaVȗ ȗ for number of ȗ ȗ ȗ  ȗ  _Gȗ  P ȗequivalence ȗ ȗ classes ȗ  ȗ on p nodes?,ȗ q'RÝȗ Average D ȗ size of   ȗ equivalence/ ȗ classes? Ó= ȗ Yȗ # ȗ ȗ  ȗ +ȗ  2ȗ Vȗ  ȗ +ȗ9ȗ ȗ 4ȗ E'#% ȗ %ȗ  ȗ  ȗ CQXȗ +ȗ  ȗ = /ȗ<ȗE\ȗ  ȗ ȗ=*ĉ;!!ĊBfȗ/ ȗ L-M&J«ȗ +ȗ ȗ ȗ 4ȗ ȗ  ¾ȗ[Radhakrishnan,-uȗ +ȗ ȗ  ȗ Solus, Uhler, UAI 2017] <aŰ = ȗ ,ȗ %ȗ &&JĦȗ ]}Řȗ  D/ȗ + ȗ Oij4nȗ CQXȗ ȗ[Katz-Rogozhnikov,%ȗ +ȗ ȗ %DZȗ  Bȗ Shanmugam,Ũ ȗ b-uȗ Squires, +ȗ ȗ Uhler, AISTATS 2019] %< ȗ zȗ9 /ȗ + ȗ OĴ!Šȗȗ ƾȗ 1ȗMȗ <Dȗ + ȗ O|7įȗ CQXȗ ȗaȗ ȗ  %xȗ =ȗ #H' ȗ EÌȗ ȗ  ȗ ȗ %ǭȗ  .Caroline]ŏřȗ Uhler Mini-course: Graphical Models Toulouse, Nov 2019 8 / 20 '=‹ȗ

ĥL&ȗ ##È ¤ |f œˆ Ď ~‘ȗ  B=ȗ,ȗ Ʀ= ȗ ȗ  \ȗ  ' ȗ,ȗ 4Lȗ ȗ #ȗ '< %' ȗ ȗ % ,ȗ  =ȗ ȗ /  ȗ 1ȗ L7ȗ R ,ȗ /%Ïȗ _#' t%ȗ ȗ ȗ 9ȗ ǠP ȗ D\Ď äCǮ Āȗŋ* āȗ Ď ŐGȗ;!!zİȗ  +ȗ Ď ²Þȗ  IEWĎ 4!!!5Lȗ }ȗ +ȗ 9ȗ / /ȗ ,ȗ ȗ B '#%ȗ  _ % ȗ ,1ȗ = _ȗ  9aa´ȗ  Ba9<\ȗ ȗ  9ȗ x@ȗ Ð<ȗ ȗP Ů ȗƬ†ȗ D 9ȗ ȗRȗȗ Ǣ P=ȗǨȗ +1ȗ   ȗ  ȗ  9'#ȗ ȗ :ȗ Pȗ < ȗ  . L&ȗ #_Ï \ȗ,ȗ 1ǀȗ 9' ȗȗ # Gȗ+ ȗ/Lȗ Vȗ Û  ' Üȗ =1 ȗ  ȗ  #*ȗ D ȗ 2ȗ ȗ ȗ / =ȗ,ȗ%,ȗ, PR ȗ / 1%\ȗ ȗąǐȗ ȗ &ȗ @ȗ 6ȗ 7ȗ ģȗ Ÿȗȗ Džȗ Yȗ Ñȗ =%ȗȅ ȗ,/ȗ  Dȗ £ñ2г÷Ď  ȗ 1 ȗ ++Pȗ G <¤ȗ UVȗ  '  ȗ  #ȗ 'ȗ  . ȗ Ƌ%ȗ Ə /*12çȗ Cȗ  ȗ  ȗ Yȗ 1ȗ  P<Ăȗ 1'ȗȗ,ȗ = ȗ,'ȗ P# %ȗ ļ DBȗ;Ĭȗj” ȗ +ȗ]ȗ ȗCQXȗ ȗH' ȗ R ,ȗ1Lȗ ~ȗ ¾ůGȗ 1ȗ  /ȗ ƅȗD ȗ P ȗ?'ȗ1 ȗȗ ȗ  Źȗ Yȗ 5¢ȗ9ȗ1 1Gȗ ǡȗ +ȗ Gȗ Í 9ȗ ȗ  šȗ ?%  ȗ ȗ ȗ ȗ +ȗ Cȗ % Bȗ:< ȗ +ȗ ȗ /#ȗ 'ȗĽ' t/ȗ Ĥȗ  ,ȗ ȗ = ȗ % G Ÿnȗ ȗ  /ȗ Yȗ CQXȗ ?*ȗ 1 ȗ ȗ ȗ'ȗ ȗ: %ȗ: £ȗ ņ+ȗȗ,¢ȗȗ /#ȗ ȗ ȗ ȗ Ȇà Bȗ +ȗ  žăȗ ȗ9ȗ %ȗŧȗ +ȗE'#% ȗ < %ȗ ȗ ȗ E* ȗ ©&ÕĎ È azī ,”ȗ °È ȗ 2ȗ sǯźsȗ+ȗ!- . @È ½ ę ȗaŒī . ' .'ȗ ȗOȗ, & #.  .·Ï+Gµ_kÈ šs„·Dzȗ

   ȗCQXŠȗȗ / D ȗ ȗ  DB'#ȗ + PDȗ ) ) W 8ðĎ())  F;&öHĎ ¯ ,#\ȗȗ= ȗȗV %ȗ <Ĩ ,ȗȗ") QĎ •–Ÿ{›—Ď 'ȗ ȗ ȗ ȗ+ȗȗ   :<ȗ : *%ȗ     ȗ ȗ   1 ȗ #GDȗ  .&Lȗ ĺ#ȗ   ȗȗ Vȗ ƌBȗBGȗȗ dž ȗ ȗ 1ȗD< Íȗ +ȗE# ȗ + % , ȗ Ɔ 'ȗ #ȗ ȗ % ȗ ǿ%ȗ + ȗ 'ȗ ȗ ȗ< ȗ +ȗ#* Lȗ V ȗ4ȗ9 ,ȗȗ %\ȗ < % Ąȗ !) ȗ #/1ȗ ȗ  ȗ   ȗ   ȗ ,ȗ‘ȗ B šȗ +ȗ1ȗ < _ȗ +ȗ ǁȗ  ȗCQXȗ ,%%ȗ ȗ ȗE1 ȗ ° È F&ȗ êȗ&ęMȗ : UȗgO¦&5pȗ Ś' ȗȗ á¸Ď ¹xlÈ PǼƧǏȗ ›uȺymÈ `›¿“»±È +ȗǰƨȔȗ È `ˆX²²m²È 2)Ď ýÎĎ þ)ÿÖĎ Y /ȗ + ȗ ° È ȗ ȗ B t#ȗ + B

Limitation of constraint-based approaches

Constraint-based methods require the faithfulness assumption:

(i, j) E Xi Xj XS S V i, j ∈ ⇐⇒ ⊥6⊥ | ∀ ⊂ \{ }

[Zhang &Spirtes, 2003]

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 9 / 20 Limitation of constraint-based approaches

Constraint-based methods require the faithfulness assumption:

(i, j) E Xi Xj XS S V i, j ∈ ⇐⇒ ⊥6⊥ | ∀ ⊂ \{ }

[Zhang &Spirtes, 2003]

Medicine + Lung Faithfulness means that causal Ex: - + effects cannot cancel out! Immune System

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 9 / 20 Faithfulness is NOT satisfied if any of the following relations hold:

1 X1 ⊥⊥ X2 ⇐⇒ det((Σ− )13,23) = a12 = 0 1 X1 ⊥⊥ X3 ⇐⇒ det((Σ− )12,23) = a13 + a12a23 = 0 1 2 X2 ⊥⊥ X3 ⇐⇒ det((Σ− )12,13) = a12a23 + a12a13 + a23 = 0 1 X1 ⊥⊥ X2 | X3 ⇐⇒ det((Σ− )1,2) = a13a23 − a12 = 0 1 X1 ⊥⊥ X3 | X2 ⇐⇒ det((Σ− )1,3) = −a13 = 0 1 X2 ⊥⊥ X3 | X1 ⇐⇒ det((Σ− )2,3) = −a23 = 0

E = Faithfulness not satisfied on collection of hypersurfaces in R| | ⇒

Unfaithful distributions: 3-node example

X1 = 1 −1 X2 = a12X1 + 2 = X (0, Σ), Σ = ⇒ ∼ N T X3 = a13X1 + a23X2 + 3 (I A)(I A)  (0, I ) − − ∼ N

 

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 10 / 20 E = Faithfulness not satisfied on collection of hypersurfaces in R| | ⇒

Unfaithful distributions: 3-node example

X1 = 1 −1 X2 = a12X1 + 2 = X (0, Σ), Σ = ⇒ ∼ N T X3 = a13X1 + a23X2 + 3 (I A)(I A)  (0, I ) − − ∼ N

Faithfulness is NOT satisfied if any of the following relations hold:

  1 X1 ⊥⊥ X2 ⇐⇒ det((Σ− )13,23) = a12 = 0 1 X1 ⊥⊥ X3 ⇐⇒ det((Σ− )12,23) = a13 + a12a23 = 0 1 2 X2 ⊥⊥ X3 ⇐⇒ det((Σ− )12,13) = a12a23 + a12a13 + a23 = 0 1 X1 ⊥⊥ X2 | X3 ⇐⇒ det((Σ− )1,2) = a13a23 − a12 = 0 1 X1 ⊥⊥ X3 | X2 ⇐⇒ det((Σ− )1,3) = −a13 = 0 1 X2 ⊥⊥ X3 | X1 ⇐⇒ det((Σ− )2,3) = −a23 = 0

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 10 / 20 Unfaithful distributions: 3-node example

X1 = 1 −1 X2 = a12X1 + 2 = X (0, Σ), Σ = ⇒ ∼ N T X3 = a13X1 + a23X2 + 3 (I A)(I A)  (0, I ) − − ∼ N

Faithfulness is NOT satisfied if any of the following relations hold:

  1 X1 ⊥⊥ X2 ⇐⇒ det((Σ− )13,23) = a12 = 0 1 X1 ⊥⊥ X3 ⇐⇒ det((Σ− )12,23) = a13 + a12a23 = 0 1 2 X2 ⊥⊥ X3 ⇐⇒ det((Σ− )12,13) = a12a23 + a12a13 + a23 = 0 1 X1 ⊥⊥ X2 | X3 ⇐⇒ det((Σ− )1,2) = a13a23 − a12 = 0 1 X1 ⊥⊥ X3 | X2 ⇐⇒ det((Σ− )1,3) = −a13 = 0 1 X2 ⊥⊥ X3 | X1 ⇐⇒ det((Σ− )2,3) = −a23 = 0

E = Faithfulness not satisfied on collection of hypersurfaces in R| | ⇒

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 10 / 20 For consistency of constraint-based algorithms data has to be bounded away from these hypersurfaces by log(p)/n

For high-dimensional consistency: pn = o(log(pn))

3-node example continued [Uhler, Raskutti, B¨uhlmann& Yu, Ann. Stat. 2013]

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 11 / 20 3-node example continued [Uhler, Raskutti, B¨uhlmann& Yu, Ann. Stat. 2013]

For consistency of constraint-based algorithms data has to be bounded away from these hypersurfaces by log(p)/n

For high-dimensional consistency: pn = o(log(pn))

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 11 / 20 For each permutation π construct a DAG Gπ = (V , Eπ) by

(π(i), π(j)) Eπ Xπ(i) Xπ(j) X π(1),...,π(i 1),π(i+1),...π(j 1) ∈ ⇐⇒ ⊥6⊥ | { − − }

Greedy search for sparsest permutation Gπ∗ (GSP) is consistent under strictly weaker conditions than faithfulness [Mohammadi, Uhler, Wang & Yu, SIAM J. Discr. Math., 2018] [Solus, Wang, Matejovicova & Uhler, arXiv:1702.03530]

Alternative approach: Permutation-based searches

Idea: DAG defined by ordering of vertices (permutation) and skeleton

For p = 10 search space is of size 10! = 3, 628, 800 versus 1018

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 12 / 20 Greedy search for sparsest permutation Gπ∗ (GSP) is consistent under strictly weaker conditions than faithfulness [Mohammadi, Uhler, Wang & Yu, SIAM J. Discr. Math., 2018] [Solus, Wang, Matejovicova & Uhler, arXiv:1702.03530]

Alternative approach: Permutation-based searches

Idea: DAG defined by ordering of vertices (permutation) and skeleton

For p = 10 search space is of size 10! = 3, 628, 800 versus 1018

For each permutation π construct a DAG Gπ = (V , Eπ) by

(π(i), π(j)) Eπ Xπ(i) Xπ(j) X π(1),...,π(i 1),π(i+1),...π(j 1) ∈ ⇐⇒ ⊥6⊥ | { − − }

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 12 / 20 Alternative approach: Permutation-based searches

Idea: DAG defined by ordering of vertices (permutation) and skeleton

For p = 10 search space is of size 10! = 3, 628, 800 versus 1018

For each permutation π construct a DAG Gπ = (V , Eπ) by

(π(i), π(j)) Eπ Xπ(i) Xπ(j) X π(1),...,π(i 1),π(i+1),...π(j 1) ∈ ⇐⇒ ⊥6⊥ | { − − }

Greedy search for sparsest permutation Gπ∗ (GSP) is consistent under strictly weaker conditions than faithfulness [Mohammadi, Uhler, Wang & Yu, SIAM J. Discr. Math., 2018] [Solus, Wang, Matejovicova & Uhler, arXiv:1702.03530]

edges in polytope of permutations (i.e., permutohedron) connect neighboring transpositions, e.g. (3, 1, 4, 2) (3, 4, 1, 2) −

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 12 / 20 Greedy SP algorithm [Mohammadi, Uhler, Wang & Yu, 2018]

1 CI relations: 1 2, 1 4 3, 1 4 2, 3 3 4 ⊥⊥ ⊥⊥ | ⊥⊥ | { } 2 4 3, 2 4 1, 3 2 ⊥⊥ | ⊥⊥ | { }

(4,2,1,3) (2,4,3,1) 1 2 1 2

3 3

4 (2,4) 4 (1,3)

1 2 (4,1) 1 2

3 3

4 4 (2,1,4,3), (1,2,4,3) (2,4,1,3)

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 13 / 20 Learning the interventional Markov equivalence class

GIES: perfect intervention adaptation of GES [Hauser & B¨uhlmann,2012]

In general not consistent [Wang, Solus, Yang & Uhler, NIPS 2017]

IGSP: interventional adaptation of GSP: provably consistent algorithm that can deal with interventional data

for perfect interventions [Wang, Solus, Yang & Uhler, NIPS 2017]

for general interventions [Yang, Katcoff & Uhler, ICML 2018]

Note: While for perfect interventions it is sufficient to perform conditional independence tests, for general interventions we need to test whether a conditional distribution is invariant to the interventions

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 14 / 20 Protein signaling network [Yang, Katcoff & Uhler, 2018]

PIP2 Plcγ PIP3 Protein signaling network described by Sachs et al. (2005); Mek Erk 7466 measurements of the abundance of phosphoproteins Raf Akt and phospholipids recorded under different interventional Jnk PKA P38 PKC experiments;

(a) Directed edge recovery (b) Skeleton recovery

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 15 / 20 Much work remains to be done to deal with zero-inflated data, off- target intervention effects, and latent variables; see our recent work [arXiv:1906.00928, 1910.09014, 1910.09007]

Perturb-seq data [Yang, Katcoff & Uhler, 2018]

After preprocessing: 992 observational samples and 13,435 interven- tional samples from 8 gene deletions; analyzed 24 genes of interest Predicted effect of each intervention when leaving out that data

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 16 / 20 Perturb-seq data [Yang, Katcoff & Uhler, 2018]

After preprocessing: 992 observational samples and 13,435 interven- tional samples from 8 gene deletions; analyzed 24 genes of interest Predicted effect of each intervention when leaving out that data Much work remains to be done to deal with zero-inflated data, off- target intervention effects, and latent variables; see our recent work [arXiv:1906.00928, 1910.09014, 1910.09007] Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 16 / 20 Tractable strategy to select interventions in batches under budget constraints for causal inference with provable guarantees on both approximation and optimization quality based on submodularity [Agrawal, Squires, Yang, Shanmugam & Uhler, AISTATS 2019]

Gene Regulation and Causal Inference Causal inference and genomics • Often interested in difference of regulatory network, e.g. between different Oftencell types, interested normal inand difference diseased states, of regulatory etc; learn network difference, e.g. directly between without normalestimating / diseased each network states; separately! learn difference directly without estimating each network separately! [Wang,[Wang Squires,, Squires, Belyaeva Belyaeva & & Uhler,Uhler, NeurIPS NeurIPS2018 2018]]

Difference network of naïve versus activated T- Difference network of ovarian cancer cells from cells (estimated from single-cell RNA-seq) 2 patient cohorts with different survival rates

• If interested in learning a collection of regulatory networks, e.g. for different cell types, learn them jointly, since these networks are related! [Wang, Segarra & Uhler, arXiv:1804.00778]

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 17 / 20 Gene Regulation and Causal Inference Causal inference and genomics • Often interested in difference of regulatory network, e.g. between different Oftencell types, interested normal inand difference diseased states, of regulatory etc; learn network difference, e.g. directly between without normalestimating / diseased each network states; separately! learn difference directly without estimating each network separately! [Wang,[Wang Squires,, Squires, Belyaeva Belyaeva & & Uhler,Uhler, NeurIPS NeurIPS2018 2018]]

Difference network of naïve versus activated T- Difference network of ovarian cancer cells from cells (estimated from single-cell RNA-seq) 2 patient cohorts with different survival rates

• TractableIf interested strategy in learning to selecta collection interventions of regulatory in batchesnetworks under, e.g. for budget different constraintscell types, learn for causalthem jointly, inference since withthese provablenetworks guaranteesare related! on both approximation and optimization quality[Wang based, Segarra on & submodularity Uhler, arXiv:1804.00778] [Agrawal, Squires, Yang, Shanmugam & Uhler, AISTATS 2019]

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 17 / 20 Statistical-computational trade-off

Open problem: Characterize the statistical-computational trade-off that is inherent to causal inference

PC

GSP GES SP Statistical assumptions Statistical Computation time

What is the optimal algorithm for unlimited computation time? (Conjecture: SP algorithm) How much weaker than faithfulness are SMR (necessary and sufficient assumption for SP) or triangle-faithfulness assumption (only violations that are undetectable)? What is the optimal tradeoff curve?

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 18 / 20 References

Peters: Causality script (http://web.math.ku.dk/~peters/jonas_files/scriptChapter1-4.pdf) Pearl: The Book of Why (Basic Books, 2018)

Wright: Correlation and causation (J. Agricult. Res. 10, 1921) Neyman: Sur les applications de la th´eorie des probabilit´esaux experiences agricoles: Essai des principes (Roczniki Nauk Rolniczych 10, 1923) Spirtes, Glymour & Scheines: Causation, Prediction and Search (MIT Press, 2001) Pearl: Causality: Models, Reasoning and Inference (Cambridge University Press, 2000) Rubin: Causal inference using potential outcomes (J. Am. Stat. Ass. 100, 2005) Robins: Association, causation, and marginal structural models (Synthese 121, 1999)

Verma & Pearl: Equivalence and synthesis of causal models (UAI, 1990) Gillispie and Perlman: Enumerating Markov equivalence classes of acyclic digraph models (UAI 2001) Radhakrishnan, Solus & Uhler: A combinatorial perspective of Markov equivalence classes for DAG models (UAI 2017) Katz, Shanmugam, Squires & Uhler: Size of interventional Markov equivalence classes in random DAG models (AISTATS, 2019)

Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 19 / 20 References

Zhang & Spirtes: The three faces of faithfulness (Synthese 193, 2016) Uhler, Raskutti, B¨uhlmann& Yu: Geometry of faithfulness assumption in causal inference (Ann. Statist. 41, 2013)

Chickering: Optimal structure identification with greedy search (J. Mach. Learn. Res. 3, ’02) Friedman & Koller: Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks (Mach. Learn. 50, 2003) Mohammadi, Uhler, Wang & Yu: Generalized permutohedra from probabilistic graphical models (SIAM J. Discr. Math. 32, 2018) Agrawal, Broderick & Uhler: Minimal I- MCMC for scalable structure discovery in causal DAG models (ICML 2018)

Hauser & B¨uhlmann:Jointly interventional and observational data: estimation of interven- tional Markov equivalence classes of directed acyclic graphs (J. Royal Stat. Soc. 77, ’15) Wang, Solus, Yang & Uhler: Permutation-based causal inference algorithms with interventions (NIPS 2017) Yang, Katcoff & Uhler: Characterizing and learning equivalence classes of causal DAGs under Interventions (ICML 2018) Eberhardt, Glymour & Scheines: On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables (UAI, 2005) Agrawal, Squires, Yang, Shanmugam & Uhler: ABCD-Strategy: Budgeted experimental design for targeted causal structure discovery (AISTATS 2019) Caroline Uhler Mini-course: Graphical Models Toulouse, Nov 2019 20 / 20