<<

Precision estimation in Gaussian graphical models

Michał Makowski

Wydział Matematyki i Informatyki Uniwersytet Wrocławski [email protected]

Michał Makowski (WMI UWr) PrecisionJune matrix 4,estimation 2020 in GGMs June 4, 2020 1 / 58 Presentation plan

1 Overview Intuition and examples Background Implementation Results

2 Theory Factorization Multiple testing Alternating direction method of multipliers

3 Even more theory

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 2 / 58 Graphical models

A graphical models theory generalizes and is able to describe a broad range of statistical models

Markov models/hidden Markov models Bayesian networks Kalman filters neural networks

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 3 / 58 Intuitions

Graphical models bound probability and

probability - uncertainty/randomness graph theory - dependence/correlation

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 4 / 58 Directed graph

P(sprinkler = off |rain) = 0.8 Sprinkler Rain

Lack Discharged of fuel battery

Wet Stalled grass engine

(a) ’Garden’ model (b) ’Car’ model

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 5 / 58 Undirected graph [1/3]

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 6 / 58 Undirected graph [2/3]

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 7 / 58 Undirected graph [3/3]

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 8 / 58 Gaussian distribution factorization

Any multivariate N(µ, Σ) can be parameterized by canonical parameters in the form γ = Σ−1µ and Θ = Σ−1.

If X ∼ N(µ, Σ) factorizes according to some graph G, then θst = 0 for any pair (s, t) ∈/ E.

This sets up correspondence between the zero pattern of the matrix Θ and pattern of the underlying graph. In particular, if the θst = 0, then variables s and t are conditionally independent, given the other variables.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 9 / 58 Graph and matrix correspondence

1 2 3 4 5 6

2 1

1 2

3

4 3 5 5 4 5 6

(c) The undirected graph G on six vertices. (d) The associated sparsity pattern of the precision matrix Θ. White squares correspond to zero entries.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 10 / 58 MLE...

Let S be a sample . MLE

Θb ML ∈ arg max {log det Θ − tr (SΘ)} p Θ∈S+

MLE exists iff the matrix S is nonsingular. Then the solution to problem above is simply given by S−1 = Θb .

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 11 / 58 ...and its problems

If a number of variables p is comparable or greater than a number of observations N, then the sample covariance matrix S is singular, thus the MLE does not exist.

Moreover, it is obvious that P(∃i,j σbij = 0) = 0, but in many applications a sparse solution is demanded.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 12 / 58 Regularization

The number of edges can be controlled by the `0-based quantity

ρ0(Θ) = I[θst 6= 0]. s6=t X

Note that ρ0(Θ) = 2|E(G)| for a given graph G.

`0-based problem

Θb ∈ arg max {log det Θ − tr (SΘ)} p Θ∈S+ ρ0(Θ)6k

Unfortunately, the `0-based constrained defines a highly nonconvex constraint set, what makes the problem hard to solve.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 13 / 58 Convex regularization

Convex relaxation of `0-based constrain leads to

Lλ(Θ, X) = log det Θ − tr (SΘ) − F (Θ).

where F (·) denotes any convex function, which might be applied in the considered problem.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 14 / 58 Main motivation - FDR control

A performance of every binary classifier could be summarized in a (+) Real value (−)

(+) True positive False positive Test outcome (−) False negative True negative .

(Local) False discovery rate (FDR)  #[False positive]  FDR = E #[False positive] + #[True positive]

#[False positive outside the component] localFDR = E #[False positive] + #[True positive]

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 15 / 58 Penalty

Lambda based on multiple testing theory gLasso - Bonferonni correction (Banerjee et. al.) gSLOPE - Holm method gSLOPE - Banjamini-Hochberg procedure

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 16 / 58 Lambda comparision

Procedure:

gLasso (Banerjee)

gSLOPE (BH)

gSLOPE (Holm)

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 17 / 58 Algorithms

For solving the Graphical SLOPE problem we used the Alternating direction method of multipliers, it can solve convex problems in the form

minimize f (x) + g(y) subject to Ax + By = c.

For solving the Graphical Lasso problem we used an algorithm proposed by Friedman et al. in theirs first work about this method. Although we derived an ADMM-based algorithm, it was orders of magnitude slower than original one.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 18 / 58 Implementation overview

Implementation with R, the huge package for simulations. Various types of graphs structure: cluster, hub, and scale-free. Data: p = 100, n ∈ {50, 100, 200, 400}; different magnitude ratios; different sparsity and size of components. Two levels of a desirable FDR control: 0.05 and 0.2 .

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 19 / 58 Cluster results

Setup: α = 0.2, 100 variables, 10 components, cluster graph, P(xij 6= 0) = 0.5. MR = −1 MR = 0 MR = 1.25 MR = 1.43 1.00

0.75

0.50

0.25

0.00 MR = 1.67 MR = 2.5 MR = 3.33 MR = 5 Value 1.00

0.75

0.50

0.25

0.00 50 100 200 400 50 100 200 400 50 100 200 400 50 100 200 400 Sample size n

Measure: FDR localFDR Power Procedure: gLasso (Banerjee) gSLOPE (BH) gSLOPE (Holm)

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 20 / 58 Cluster ROC

Setup: scaled α = 0.05, 100 variables, 10 components, cluster graph, MR = 3.33, P(xij 6= 0) = 0.5.

1.00

0.75

0.50 TPR

0.25

0.00

0.00 0.25 0.50 0.75 1.00 FPR

Procedure: gLasso (Banerjee) gSLOPE (BH) gSLOPE (Holm)

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 21 / 58 Hub results

Setup: α = 0.2, 100 variables, hub graph.

#[components] = 5 #[components] = 10 #[components] = 20 1.00

0.75 MR

0.50 = 0 0.25

0.00 1.00

0.75 MR = 0.50 1 . 43 Value 0.25

0.00 1.00

0.75 MR = 0.50 3 . 33 0.25

0.00 50 100 200 400 50 100 200 400 50 100 200 400 Sample size n

Measure: FDR localFDR Power Procedure: gLasso (Banerjee) gSLOPE (BH) gSLOPE (Holm)

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 22 / 58 Non-cluster ROC

Setup: scaled α = 0.05, 100 variables, 10 components, hub graph, MR = 3.33. Setup: scaled α = 0.05, 100 variables, scale-free graph, MR = 3.33.

1.00 1.00

0.75 0.75

0.50 0.50 TPR TPR

0.25 0.25

0.00 0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 FPR FPR

Procedure: gLasso (Banerjee) gSLOPE (BH) gSLOPE (Holm) Procedure: gLasso (Banerjee) gSLOPE (BH) gSLOPE (Holm)

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 23 / 58 Let’s go deeper into theory...

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 24 / 58 Conditional independence

Two random variables X , Y are conditionally independent wrt random vector T Z = (Z1,..., Zn) iff their distributions are independent wrt Z. This relationship is denoted by symbol ⊥⊥.

Formally, for all triples x, y, z

(X ⊥⊥ Y ) | Z ⇐⇒ FX ,Y |Z=z(x, y) = FX |Z=z(x) · FY |Z=z(y),

where FX ,Y | Z = z(x, y) is the conditional CDF of X and Y for a given Z.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 25 / 58 Conditional independence and graph structure

For a family of multivariate probability distributions there could be constructed a graph which constraints a PDF factorization and the conditional independence property

node A is not connected with node B ⇐⇒ XA ⊥⊥ XB | X−AB ,

where X−AB denotes all random variables except XA and XB .

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 26 / 58 Conditional independence and precision matrix

Canonical parameterization of Multivariate Gaussian Distribution constraint a precision matrix and a conditional independence property.

Let (X1,..., Xn) ∼ N(γ, Θ), then

Xs ⊥⊥ Xt | X−st ⇐⇒ θst = 0,

where X−st denotes all random variables except Xs and Xt .

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 27 / 58 Graph structure and precision matrix

There is connection between an underlying graph and a precision matrix structure

node S is not connected with node S ⇐⇒ θst = 0,

where s and t are variable indexes of Multivariate Gaussian Distribution.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 28 / 58 Multiple testing

Bonferonni correction α The Bonferroni correction rejects the null hypothesis for each pi 6 m , thereby controls the FWER at level α, that is FWER 6 α. No other assumptions are required.

Holm method α Sort p-values ascending and reject the first hypothesis for which p(k) > m+1−k is true, then reject every hypothesis before. Holm method controls FWER at level α.

Benjamini-Hochberg method k Sort p-values ascending and find first the hypothesis for which p(k) 6 α m is true, then reject every hypothesis before. BH method controls FDR at level α.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 29 / 58 Graphical Lasso

The convex relaxation of `0-based constrain leads to

Lλ(Θ, X) = log det Θ − tr (SΘ) − λk Θ k1.

where k · k1 denotes entrywise off-diagonal `1-norm kAk1 = i6=j |aij |. P Graphical Lasso problem

Θb ∈ arg max {log det Θ − tr (SΘ) − λk Θ k1} . p Θ∈S+

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 30 / 58 Graphical Lasso parameter choice

Banerjee lambda for Graphical Lasso qt (1 − α ) Banerjee n−2 2p2 λ (α) = max(sii , sjj ) (1) i

The following theorem was formulated by Banerjee et al. Theorem Using (1) as the penalty parameter in Graphical Lasso problem, for any fixed level α we obtain

P(False Discovery) 6 α.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 31 / 58 SLOPE

In the [Bog+15] Bogdan et al. proposed novel approach for regularization in regression analysis. SLOPE uses the OL1 norm instead of the L1 norm for a coefficient choice. OL1 norm p Penalty based on sorted-`1 (known as OL1, OWL or OSCAR) for β ∈ R and p λ ∈ R , λ1 > ... > λp is defined as

p

Jλ(β) = λi |β|(i). i=1 X It was shown that under some assumptions and a construction of λ parameters based on BH procedure, proposed method controls FDR in multivariate regression settings.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 32 / 58 Graphical SLOPE

In the graphical SLOPE the L1 norm from a graphical Lasso algorithm is changed for the off-diagonal OL1 norm

Lλ(Θ, X) = log det Θ − tr (SΘ) − Jλ(Θ).

Graphical SLOPE problem

Θb ∈ arg max {log det Θ − tr (SΘ) − Jλ(Θ)} , p Θ∈S+

In the [Sob19] P. Sobczyk showed that OL1 brings promising results in terms of a FDR control.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 33 / 58 Graphical SLOPE parameter choice (1/2)

Holm lambda for Graphical SLOPE p(p − 1) m = , 2 αk Holm qtn−2(1 − m ) λk = q , 2 αk n − 2 + qtn−2(1 − m ) Holm Holm Holm Holm λ = {λ1 , λ2 , ..., λm }.

It is based on the Holm method for the multiple testing.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 34 / 58 Graphical SLOPE parameter choice (2/2)

BH lambda for Graphical SLOPE p(p − 1) m = , 2 α BH qtn−2(1 − m+1−k ) λk = , q 2 α n − 2 + qtn−2(1 − m+1−k ) BH BH BH BH λ = {λ1 , λ2 , ..., λm }.

It is based on the Benjamini-Hochberg procedure for the multiple testing.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 35 / 58 Precursors

The alternating direction method of multipliers (ADMM) is an algorithm that solves convex optimization problems by breaking them into smaller pieces, each of which are then easier to handle.

The precursors of ADMM Dual Ascent and Dual Decomposition algorithms (decomposability properties) Method of multpliers (convergence properties)

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 36 / 58 ADMM

The ADMM algorithm can solve problems in the form minimize f (x) + g(y) subject to Ax + By = c, where convexity of functions f and g is assumed. Augmented Lagrangian with parameter ρ > 0 is defined as ρ L (x, y, ν) = f (x) + g(y) + νT (Ax + By − c) + kAx + By − bk2. ρ 2

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 37 / 58 ADMM for gSLOPE

Optimisation problem for graphical SLOPE is in the form

minimize − log det Θ + tr (SΘ) + I[Θ  0] + Jλ(Y ) subject to Y = Θ .

p×p p×p p×p Augmented Lagrangian Lρ : R × R × R → R with parameter ρ > 0 is given by

Lρ(X , Y , N) = − log det Θ + tr (SΘ) + I[Θ  0] + Jλ(Y )+ ρ ρhN, Θ −Y i + k Θ −Y k2 . F 2 F

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 38 / 58 Bibliography

Onurena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont. Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. 2008. Małgorzata Bogdan et al. SLOPE - Adaptive variable selection via convex optimization. 2015. Stephen Boyd et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. 2010. Emmanuel Candes. Advanced Topics in Convex Optimization. Trevor Hastie et al. Statistical Learning with Sparsity: The Lasso and Generalizations. 2015. Piotr Sobczyk. Identifying low-dimensional structures through model selection in high-dimensional data. 2019.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 39 / 58 Let’s dive even deeper...

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 40 / 58 Factorization theorem

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 41 / 58 Compatibility function

Let G = (V , E) be a graph with a vertex set V = 1, 2,..., p and C be its clique set. Let X = (X1,..., Xp) be a random vector defined on a probability space (Ω, F, P), indexed by the graph nodes.

Definition (Compatibility function)

Let C ∈ C be a clique of the graph G and let XC be a subvector of the vector X indexed by the elements of the clique C, that is XC = (Xs , s ∈ C). A real-valued function ψC of the vector XC taking positive real values is called a compatibility function.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 42 / 58 Factorization property

Definition (Factorization)

Let C ∈ C be a clique of the graph G and let XC be a subvector of the vector X indexed by the elements of the clique C, that is XC = (Xs , s ∈ C). A real-valued function ψC of the vector XC taking positive real values is called a compatibility function.

Given a collection of compatibility functions, we say that P factorizes over G if it has decomposition 1 (x ,..., x ) = ψ (x ), (2) P 1 n Z C C C∈C Y where Z is the normalizing constant, known as the partition function. It is given by

Z = ψC (xC ), (3) x C∈C X Y where the sum goes over all possible realizations of X. Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 43 / 58 Markov property

Consider a cut set S of the given graph and let introduce a symbol ⊥⊥ to denote the relation is conditionally independent of. With this notation, we say that the random vector X is Markov with respect to G if XA ⊥⊥ XB | XS for all cut sets S ⊂ V , (4)

where XA denotes the subvector indexed by the subgraph A.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 44 / 58 Canonical formulation

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 45 / 58 Canonical formulation

Any nondegenerated multivariate normal distribution N(µ, Σ) can be reparameterized into canonical parameters in the form

γ = Σ−1µ and Θ = Σ−1.

Then density function is given by

p p 1 (x) = exp γ x − θ x x − A(γ, Θ) , Pγ,Θ s s 2 st s t s=1 s,t=1  X X 1 −1 T −1  where A(γ, Θ) = − 2 det[(2π) Θ] + γ Θ γ .

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 46 / 58 Canonical formula derivation

 −1 1 (x) = pdet[2πΣ] exp (− (x − µ)T Σ−1(x − µ) Pµ,Σ 2

q  1 1 = det[(2πΣ)−1] exp − xT Σ−1x + xT Σ−1µ − µT Σ−1µ 2 2

q −1 1 1 = det[(2π)−1 Θ] exp − xT Θ x + xT γ − γT Θ−1 γ 2 2

1 1 = exp − xT Θ x + xT γ − det[(2π)−1 Θ] + γT Θ−1 γ 2 2

1 = exp − xT Θ x + xT γ − A(γ, Θ) 2

= Pγ,Θ(x)

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 47 / 58 Log-likelihood derivation

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 48 / 58 Log-likelihood derivation (1/2)

1 N (Θ, X) = log (x ) L N PΘ i i=1 X 1 N 1 = − xT Θ x − A(Θ) N 2 i i i=1 X 1 N 1 1 = log det[(2π)−1 Θ] − xT Θ x N 2 2 i i i=1 X 1 N = log (2π)−N det[Θ] − xT Θ x 2N i i i=1 X 1 N = log det Θ − N log 2π − xT Θ x = ... 2N i i i=1 X Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 49 / 58 Log-likelihood derivation (2/2)

1 N ... = log det Θ − N log 2π − xT Θ x 2N i i i=1 X 1 N = log det Θ − N log 2π − tr xT Θ x  2N i i i=1 X 1 N 1 N = log det Θ − log 2π − tr x xT Θ 2 2 2N i i i=1 X 1 N 1 = log det Θ − log 2π − tr (SΘ) , 2 2 2

1 N T where S is an empirical covariance matrix given by N i=1 xi xi . P Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 50 / 58 ADMM for Graphical SLOPE

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 51 / 58 ADMM for Graphical SLOPE

Graphical SLOPE problem - ADMM formulation

minimize − log det X + tr (XS) + I[X  0] + Jλ(Y ) subject to X = Y .

Graphical SLOPE problem - Augmented Lagrangian

Lρ(X , Y , N) = − log det X + tr (XS) + I[X  0] ρ + λkY k + ρhN, X − Y i + kX − Y k2 1 F 2 F

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 52 / 58 X-update (1/3)

We have

ρ 2 ˜ Xk = arg min Lρ(X , Yk−1, Nk−1) = arg min − log det X + X − Sk−1 , X X 0  2 F 

where 1 S˜ = −N + Y − S, k−1 k−1 k−1 ρ The X -gradient of the augmented Lagrangian is given by

−1 ∇X Lρ(X , Yk−1, Nk−1) = −X + ρX − ρS˜k−1.

As the augmented Lagrangian is convex, it is clear that for some X ∗  0

∗ ∗ −1 ∗ ∇X Lρ(X , Yk−1, Nk−1) = −(X ) + ρX − ρS˜k−1 = 0.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 53 / 58 X-update (2/3)

Rewriting equation as ∗ −1 ∗ −(X ) + ρX = ρS˜k−1, we can find a matrix that meets this condition.

At first, lets take the eigenvalue decomposition of right side

T ρS˜k−1 = ρQΛQ .

Then by multiplying right and left side by Q and QT respectively, we obtain

−(X˜∗)−1 + ρX˜∗ = ρΛ,

where X˜∗ = QT X ∗Q.

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 54 / 58 X-update (3/3)

∗ We have to find positive numbers x˜ii that satisfy 1 (x˜∗)2 − l x˜∗ − = 0. ii ii ii ρ It is obvious that q 2 li + li + 4/ρ x˜ = . ii 2 ∗ ∗ T ∗ Thus X is given by X = Q X˜ Q. All diagonals are positive since ρ > 0. Define Fρ(Λ) as 1 q F (Λ) = diag l + l2 + 4/ρ . ρ 2 i i

Since that  1  X ∗ = QT X˜∗Q = QT F (Λ)Q = F (S˜ ) = F −N + Y − S , ρ ρ k−1 ρ k−1 k−1 ρ

we obtain a formula for updating Xk in each step. Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 55 / 58 Y-update

A formula for Yk is different. We have

Yk = arg min Lρ(Xk , Y , Nk−1) Y ρ 2 = arg min Jλ(Y ) + kY − (Xk + Nk−1)kF Y 2

The last line of Y -update can be represented as a proximity operator which has closed form formula for SLOPE

ρ 2 arg min Jλ(Y ) + kY − (Xk + Nk−1)kF = proxJλ,ρ (Xk + Nk−1) . (5) Y 2

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 56 / 58 ADMM for Graphical SLOPE

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 57 / 58 Thank you for your attention

Michał Makowski (WMI UWr) Precision matrix estimation in GGMs June 4, 2020 58 / 58