The Hammersley-Clifford Theorem and Its Impact on Modern Statistics

The Hammersley-Clifford Theorem and its Impact on Modern Statistics Helge Langseth Department of Mathematical Sciences Norwegian University of Science and Technology Hammersley-Clifford Theorem – p.1/20 Outline ! Historical review ! Hammersley-Clifford’s theorem ! Usage in • Spatial models on a lattice • Point processes • Graphical models • Markov Chain Monte Carlo ! Conclusion Hammersley-Clifford Theorem – p.2/20 Example: The Ising model (Ising, 1925): ! Model for ferromagnetism −1 ! Xi;j 2 {−1; 1g, Ei;j(x) = xi;j · x`;m kT x`;m2N (xi;j ) 1 P ! p(x) = Z · exp(− i;j Ei;j(x)) P Markov chains in higher dimensions Xi−1 Xi Xi+1 Paul Levy´ (1948) Xi;j ! Define neighbouring set in the 2D-model: N (xi;j) = fxi−1;j; xi+1;j; xi;j−1; xi;j+1g ! Sought independence relations: p xi;jjx n fxi;jg = p xi;jjN (xi;j) Hammersley-Clifford Theorem – p.3/20 ! Define neighbouring set in the 2D-model: N (xi;j) = fxi−1;j; xi+1;j; xi;j−1; xi;j+1g ! Sought independence relations: p xi;jjx n fxi;jg = p xi;jjN (xi;j) Markov chains in higher dimensions Xi−1 Xi Xi+1 Paul Levy´ (1948) Xi;j Example: The Ising model (Ising, 1925): ! Model for ferromagnetism −1 ! Xi;j 2 {−1; 1g, Ei;j(x) = xi;j · x`;m kT x`;m2N (xi;j ) 1 P ! p(x) = Z · exp(− i;j Ei;j(x)) P Hammersley-Clifford Theorem – p.3/20 Defining the Markov models in two dimensions Xi;j Xi;j x x p( ) = i;j Ψi;j (xi;j ; N (xi;j )) p(xi;j j n fxi;j g) = p (xi;j jN (xi;j )) Q Joint model (Whittle, 1963) Conditional model (Bartlett, 1966) ! For Nearest neighbour systems: The class of joint models contains the class of conditional models (Brook, 1964) ! Not known (at the time) how to define the full joint distribution from the conditional distributions ! Severe constraints in Bartlett’s model Hammersley-Clifford Theorem – p.4/20 Besag (1972) on nearest neighbour systems What is the most general form of the conditional probability functions that define a coherent joint function? And what will the joint look like? ! Assume p (x) > 0, and define p(xi;j jN (xi;j ) Q (xi;jjxi−1;j; xi+1;j; xi;j−1; xi;j+1) = log . p(0jN (xi;j )) ! Q(x j t; u; v; w) ≡ xf 0(x)+t 1(x; t)+u 1(u; x)+v 2(x; v)+w 2(w; x)g ! Let xB be the boundary, and xI = x n xB. p(x jx = 0) = 1 · exp x (x )+ I B Z i;j i;j 0 i;j x (x ; xP )+x (x ; x ) i−1;j 1 i;j i−1;j i;j−1 2 i;j i;j−1 Hammersley-Clifford Theorem – p.5/20 Hammersley-Clifford’s theorem - Notation N (Xi) Xi ! Define a graph G = (V; E), s.t. V = fX1; : : : ; Xng and fXi; Xjg 2 E iff p(xi j fx1; : : : ; xng n fxig) =6 p(xi j fx1; : : : ; xng n fxi; xjg) ! Define N (Xi) s.t. Xj 2 N (Xi) iff fXi; Xjg 2 E ! C ⊆ V is a clique iff C ⊆ fX; N(X)g 8X 2 C. Hammersley-Clifford Theorem – p.6/20 Hammersley-Clifford’s theorem - Result Assume that p(x1; : : : ; xn) > 0 (positivity condition). Then, 1 p(x) = φ (x ) Z C C C2Ycl(G) Thus, the following are equivalent (given the positivity condition): Local Markov property: p xi j x n fxig = p xi j N (xi) Factorization property: The probabilityfactorizes according to the cliques of the graph Global Markov property: p(xA j xB; xS) = p(xA j xS) whenever xA and xB are separated by xS in G Hammersley-Clifford Theorem – p.7/20 Hammersley-Clifford’s theorem - Proof Line of proof due to Besag (1974), who clarified the original “circuitous” proof by Hammersley & Clifford ! Assume the positivity condition to be correct ! Let Q(x) = log p(x)=p(0) ! There exists a unique expansion of Q(x), Q(x) = xiGi(xi) + xixjGi;j(xi; xj) + · · · 1X≤i≤n 1≤Xi<j≤n + x1x2 : : : xnG1;2;:::;n(x1; x2; : : : ; xn) ! Gi;j;:::;s(xi; xj; : : : ; xs) =6 0 only if fi; j; : : : ; sg 2 cl(G) Hammersley-Clifford Theorem – p.8/20 Positivity condition: Historical implications ! Hammersley & Clifford (1971) base their proof on the positivity condition: p(x1; : : : ; xn) > 0 ! They find the positivity condition unnatural, and postpones publication in hope of relaxing it ! They are thereby preceded by Besag (1974) in publishing the theorem ! Moussouris (1974) shows by a counter-example involving only four variables that the positivity condition is required Hammersley-Clifford Theorem – p.9/20 Markov properties on DAGs ! ! X1 X2 Define a DAG G = (V; E ) for a well-ordering X1 ≺ X2 ≺ · · · ≺ Xn s.t. X3 ! V = fX1; : : : ; Xng (as before) ! ! Assume Xj ≺ Xi. Then (Xj; Xi) 2 E X4 X5 ! (i.e., Xj ! Xi in G ) iff p(xi j x1; : : : ; xi−1) =6 X6 p(xi j x1; : : : ; xj−1; xj+1; : : : ; xi−1) ! Define the parents of Xi as pa(Xi) = fXj : (Xj; Xi) 2 E g Directed factorization property: p(x) factorizes according to G! iff p(x) = i p xi j pa(xi) Q Hammersley-Clifford Theorem – p.10/20 Markov properties on DAGs (cont’d) m m X1 X2 ! Define moral graph G = (V; E ) from G! by connecting parents and dropping X3 edge directions m ! Note that fXi; pa(Xi)g 2 cl(G ), i.e., m X4 X5 factorization relates to cl(G ) ! Local and Global Markov properties X6 defined “as usual” The following are equivalent even without the positivity condition (Lauritzen et al., 1990): ! Factorization property ! Local Markov property ! Global Markov property Hammersley-Clifford Theorem – p.11/20 Spatial statistics The theorem has had major implications in many areas of spatial statistics. Application areas include: ! Quantitative geography (e.g, Besag, 1975) ! Geographical analysis of the spread of diseases (e.g, Clayton & Kaldor,1987) ! Image analysis (e.g, Geman & Geman, 1984) Hammersley-Clifford Theorem – p.12/20 Markov Point Processes ! Consider a point process on e.g. Rn N (ξjr) ! Let x = fx1; x2; : : : ; xmg be r the observed points ξ ! Define the neighbour set as N (ξjr) = fxi : jjξ−xijj ≤ rg ! A density function f is Markov if f(ξ j x) depends only on ξ and N (ξ) \ x ! Ripley&Kelly (1977): f(x) is a Markov function iff there x 1 x exist functions φC s.t. f( ) = Z C2cl(G) φC ( C ) Q Hammersley-Clifford Theorem – p.13/20 ! Connection with Hammersley & Clifford’s theorem made by Darroch et al. (1980): • G is defined s.t. Xi and Xj are only connected if uij =6 0 (with consistency assumptions) • A Hammersley & Clifford theorem can be proven for this structure • Representational benefits follows for the class of graphical models Log-linear models ! The analysis of contingency tables set into the framework of log-linear models in the 70’s ! log p(x) = uφ + i ui(xi) + · · · + u1:::n(x1; : : : ; xn) P Hammersley-Clifford Theorem – p.14/20 Log-linear models ! The analysis of contingency tables set into the framework of log-linear models in the 70’s ! log p(x) = uφ + i ui(xi) + · · · + u1:::n(x1; : : : ; xn) ! Connection with HammersleP y & Clifford’s theorem made by Darroch et al. (1980): • G is defined s.t. Xi and Xj are only connected if uij =6 0 (with consistency assumptions) • A Hammersley & Clifford theorem can be proven for this structure • Representational benefits follows for the class of graphical models Hammersley-Clifford Theorem – p.14/20 MCMC and the Gibbs sampler ! Metropolis-Hastings algorithm: Define a Markov chain which has a desired distribution π(·) as its unique stationary distribution Algorithm: 1. Initialization: x(0) fixed value 2. For i = 1; 2; : : :: i) Sample y from q(y j x(i−1)) π(y) · q(x(i−1) j y) ii) Define αy π(x(i−1)) · q(y j x(i−1)) y with p = minf1; αyg iii) x(i) 8 < x(i−1) with p = maxf0; 1 − αyg : Hammersley-Clifford Theorem – p.15/20 MCMC and the Gibbs sampler (cont’d) ! Geman & Geman (1984): Metropolis Hastings for high-dimensional x ! Problem: How to sample y and calculate αy efficiently? (i) ! Solution: Flip only one element xj at a time: (i+1) (i) (i) (i+1) (i) (i) x = x1 ; : : : ; xj−1; xj ; xj+1; : : : ; xn ! q y j x(i) is defined by the conditional probability (i) p xj j x : (i+1) (i) 1 (i) p xj j x = φC xC Zj C:YXj 2C ! αy = 1 for the Gibbs sampler ! An algorithm of constant time complexity can be designed! Hammersley-Clifford Theorem – p.16/20 Too much of a good thing? ! Global properties from local models: • Model error dominates (e.g. Rue and Tjelmeland, 2002) • The critical temperature of the Ising model “Beware — Gibbs sampling can be dangerous!” Spiegelhalter et al. (1995): The BUGS v0.5 manual, p. 1 ! Alternative representations: • Bayesian networks (e.g. Pearl, 1988) • Vines (e.g. Bedford and Cooke, 2001) Hammersley-Clifford Theorem – p.17/20 Clifford’s (MCMC) conclusion “. from now on we can compare our data with the model we actually want to use rather than with a model which has some mathematical convenient form. This is surely a revolution.” Dr. Peter Clifford (1993), The Royal Statistical Society meeting on the Gibbs sampler and other statistical Markov Chain Monte Carlo methods Journal of the Royal Statistical Society, Series B, 55(1), p.

The Hammersley-Clifford Theorem and Its Impact on Modern Statistics

STATISTICAL SCIENCE Volume 36, Number 3 August 2021

A Matching Based Theoretical Framework for Estimating Probability of Causation

December 2000

ALEXANDER PHILIP DAWID Emeritus Professor of Statistics University of Cambridge E-Mail: [email protected] Web

Theory and Applications of Proper Scoring Rules

PHIL 8670 (Fall, 2015): Philosophy of Statistics

Computational and Financial Econometrics (CFE 2017)

Hyper and Structural Markov Laws for Graphical Models

Probability Forecasts and Prediction Markets A

Extended Conditional Independence and Applications in Causal Inference

Sufficient Covariate, Propensity Variable and Doubly Robust

Decision-Theoretic Foundations for Statistical Causality